Categref心跳报错

描述

之前在工作中碰到的问题，在容器环境中运行夜莺的categref采集器会出现心跳失败的情况，如下：

root@localhost:/apps/data# /usr/local/bin/categraf --configs /apps/data/.gitrce/categraf/conf
2024/12/10 22:06:57 main.go:149: I! runner.binarydir: /usr/local/bin
2024/12/10 22:06:57 main.go:150: I! runner.hostname: localhost.localdomain
2024/12/10 22:06:57 main.go:151: I! runner.fd_limits: (soft=1048576, hard=1048576)
2024/12/10 22:06:57 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2024/12/10 22:06:57 provider_manager.go:60: I! use input provider: [local]
2024/12/10 22:06:57 prometheus_agent.go:19: I! prometheus scraping disabled!
2024/12/10 22:06:57 ibex_agent.go:19: I! ibex agent disabled!
2024/12/10 22:06:57 agent.go:38: I! agent starting
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.cpu started
2024/12/10 22:06:57 diskio.go:64: E! failed to get disk io: open /hostfs/proc/diskstats: no such file or directory
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.diskio started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.ipmi started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.mem started
2024/12/10 22:06:57 net.go:65: E! failed to get net io metrics: open /hostfs/proc/net/dev: no such file or directory
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.net started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.netstat started
2024/12/10 22:06:57 netstat.go:69: E! failed to read sockstat /hostfs/proc/net/sockstat open /hostfs/proc/net/sockstat: no such file or directory
2024/12/10 22:06:57 diskstats_common.go:99: I! Parsing flag --collector.diskstats.device-exclude flag ^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.node_exporter started
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name filenotify duration_seconds 0.000130781
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netdev duration_seconds 0.000965317
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name diskstats duration_seconds 0.001339931
2024/12/10 22:06:57 collector.go:189: I! collector succeeded name netstat duration_seconds 0.001483027
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.smart started
2024/12/10 22:06:57 metrics_agent.go:319: I! input: local.system started
2024/12/10 22:06:57 system.go:46: E! failed to gather cpu number: open /hostfs/proc/stat: no such file or directory
2024/12/10 22:06:57 agent.go:46: I! [*agent.MetricsAgent] started
2024/12/10 22:06:57 agent.go:49: I! agent started


2024/12/10 22:07:00 heartbeat.go:150: E! failed to marshal heartbeat request: json: unsupported value: NaN // [!code error]

^C2024/12/10 22:07:01 main.go:131: I! received signal: interrupt
2024/12/10 22:07:01 agent.go:53: I! agent stopping
2024/12/10 22:07:01 agent.go:61: I! [*agent.MetricsAgent] stopped
2024/12/10 22:07:01 agent.go:64: I! agent stopped
2024/12/10 22:07:01 main.go:144: I! exited
root@localhost:/apps/data#

在这里出现了一个NaN的报错，然后我在 GitHub 上提交了一个issue，并且得到了解决方案

解决方案

有位大佬提供了两个解决方案，我大概看了下，应该是我启动容器时的env环境的问题

按照大佬所说的，我在容器启动配置中将HOST_PROC这个参数删掉了，然后重建容器，再次查看，心跳正常，问题成功解决！

{
    cat >$_compose_file <<EOF
services:
  master:
    container_name: $_container_name
    image: $_image2
    restart: always
    network_mode: host
    ipc: host
    uts: host
    privileged: true
    security_opt:
      - apparmor:unconfined
    environment:
      - TINI_SUBREAPER=1
      - GIT_REMOTE_REPO=$_gre_remote_repo
      - APPS_DATA=$_apps_data
      - CONTAINER_NAME=$_container_name
      - HOST_PROC=/hostfs/proc // [!code --]
      - HOST_MOUNT_PREFIX=/hostfs
EOF
}