描述
之前在工作中碰到的问题,在容器环境中运行夜莺的categref采集器会出现心跳失败的情况,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| root@localhost:/apps/data# /usr/local/bin/categraf --configs /apps/data/.gitrce/categraf/conf 2024/12/10 22:06:57 main.go:149: I! runner.binarydir: /usr/local/bin 2024/12/10 22:06:57 main.go:150: I! runner.hostname: localhost.localdomain 2024/12/10 22:06:57 main.go:151: I! runner.fd_limits: (soft=1048576, hard=1048576) 2024/12/10 22:06:57 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited) 2024/12/10 22:06:57 provider_manager.go:60: I! use input provider: [local] 2024/12/10 22:06:57 prometheus_agent.go:19: I! prometheus scraping disabled! 2024/12/10 22:06:57 ibex_agent.go:19: I! ibex agent disabled! 2024/12/10 22:06:57 agent.go:38: I! agent starting 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.cpu started 2024/12/10 22:06:57 diskio.go:64: E! failed to get disk io: open /hostfs/proc/diskstats: no such file or directory 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.diskio started 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.ipmi started 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.mem started 2024/12/10 22:06:57 net.go:65: E! failed to get net io metrics: open /hostfs/proc/net/dev: no such file or directory 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.net started 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.netstat started 2024/12/10 22:06:57 netstat.go:69: E! failed to read sockstat /hostfs/proc/net/sockstat open /hostfs/proc/net/sockstat: no such file or directory 2024/12/10 22:06:57 diskstats_common.go:99: I! Parsing flag --collector.diskstats.device-exclude flag ^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$ 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.node_exporter started 2024/12/10 22:06:57 collector.go:189: I! collector succeeded name filenotify duration_seconds 0.000130781 2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netdev duration_seconds 0.000965317 2024/12/10 22:06:57 collector.go:189: I! collector succeeded name diskstats duration_seconds 0.001339931 2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netstat duration_seconds 0.001483027 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.smart started 2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.system started 2024/12/10 22:06:57 system.go:46: E! failed to gather cpu number: open /hostfs/proc/stat: no such file or directory 2024/12/10 22:06:57 agent.go:46: I! [*agent.MetricsAgent] started 2024/12/10 22:06:57 agent.go:49: I! agent started
2024/12/10 22:07:00 heartbeat.go:150: E! failed to marshal heartbeat request: json: unsupported value: NaN // [!code error]
^C2024/12/10 22:07:01 main.go:131: I! received signal: interrupt 2024/12/10 22:07:01 agent.go:53: I! agent stopping 2024/12/10 22:07:01 agent.go:61: I! [*agent.MetricsAgent] stopped 2024/12/10 22:07:01 agent.go:64: I! agent stopped 2024/12/10 22:07:01 main.go:144: I! exited root@localhost:/apps/data#
|
![image]()
在这里出现了一个NaN的报错,然后我在 GitHub 上提交了一个issue,并且得到了解决方案
解决方案
有位大佬提供了两个解决方案,我大概看了下,应该是我启动容器时的env环境的问题
![image]()
按照大佬所说的,我在容器启动配置中将HOST_PROC这个参数删掉了,然后重建容器,再次查看,心跳正常,问题成功解决!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| { cat >$_compose_file <<EOF services: master: container_name: $_container_name image: $_image2 restart: always network_mode: host ipc: host uts: host privileged: true security_opt: - apparmor:unconfined environment: - TINI_SUBREAPER=1 - GIT_REMOTE_REPO=$_gre_remote_repo - APPS_DATA=$_apps_data - CONTAINER_NAME=$_container_name - HOST_PROC=/hostfs/proc // [!code --] - HOST_MOUNT_PREFIX=/hostfs EOF }
|