Categref心跳报错
Reverse Lv4

描述

之前在工作中碰到的问题,在容器环境中运行夜莺的categref采集器会出现心跳失败的情况,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
root@localhost:/apps/data# /usr/local/bin/categraf --configs /apps/data/.gitrce/categraf/conf
2024/12/10 22:06:57 main.go:149: I! runner.binarydir: /usr/local/bin
2024/12/10 22:06:57 main.go:150: I! runner.hostname: localhost.localdomain
2024/12/10 22:06:57 main.go:151: I! runner.fd_limits: (soft=1048576, hard=1048576)
2024/12/10 22:06:57 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2024/12/10 22:06:57 provider_manager.go:60: I! use input provider: [local]
2024/12/10 22:06:57 prometheus_agent.go:19: I! prometheus scraping disabled!
2024/12/10 22:06:57 ibex_agent.go:19: I! ibex agent disabled!
2024/12/10 22:06:57 agent.go:38: I! agent starting
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.cpu started
2024/12/10 22:06:57 diskio.go:64: E! failed to get disk io: open /hostfs/proc/diskstats: no such file or directory
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.diskio started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.ipmi started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.mem started
2024/12/10 22:06:57 net.go:65: E! failed to get net io metrics: open /hostfs/proc/net/dev: no such file or directory
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.net started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.netstat started
2024/12/10 22:06:57 netstat.go:69: E! failed to read sockstat /hostfs/proc/net/sockstat open /hostfs/proc/net/sockstat: no such file or directory
2024/12/10 22:06:57 diskstats_common.go:99: I! Parsing flag --collector.diskstats.device-exclude flag ^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.node_exporter started
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name filenotify duration_seconds 0.000130781
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netdev duration_seconds 0.000965317
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name diskstats duration_seconds 0.001339931
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netstat duration_seconds 0.001483027
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.smart started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.system started
2024/12/10 22:06:57 system.go:46: E! failed to gather cpu number: open /hostfs/proc/stat: no such file or directory
2024/12/10 22:06:57 agent.go:46: I! [*agent.MetricsAgent] started
2024/12/10 22:06:57 agent.go:49: I! agent started


2024/12/10 22:07:00 heartbeat.go:150: E! failed to marshal heartbeat request: json: unsupported value: NaN // [!code error]

^C2024/12/10 22:07:01 main.go:131: I! received signal: interrupt
2024/12/10 22:07:01 agent.go:53: I! agent stopping
2024/12/10 22:07:01 agent.go:61: I! [*agent.MetricsAgent] stopped
2024/12/10 22:07:01 agent.go:64: I! agent stopped
2024/12/10 22:07:01 main.go:144: I! exited
root@localhost:/apps/data#

image

在这里出现了一个NaN的报错,然后我在 GitHub 上提交了一个issue,并且得到了解决方案

解决方案

有位大佬提供了两个解决方案,我大概看了下,应该是我启动容器时的env环境的问题

image

按照大佬所说的,我在容器启动配置中将HOST_PROC这个参数删掉了,然后重建容器,再次查看,心跳正常,问题成功解决!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
cat >$_compose_file <<EOF
services:
master:
container_name: $_container_name
image: $_image2
restart: always
network_mode: host
ipc: host
uts: host
privileged: true
security_opt:
- apparmor:unconfined
environment:
- TINI_SUBREAPER=1
- GIT_REMOTE_REPO=$_gre_remote_repo
- APPS_DATA=$_apps_data
- CONTAINER_NAME=$_container_name
- HOST_PROC=/hostfs/proc // [!code --]
- HOST_MOUNT_PREFIX=/hostfs
EOF
}