node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

thermal_zone collector stuck on Jetson Orin Nano

Open gouthamve opened this issue 1 year ago • 4 comments

Host operating system: output of uname -a

Linux jetson 5.15.136-tegra #1 SMP PREEMPT Wed Apr 24 19:36:48 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.1 (branch: HEAD, revision: 400c3979931613db930ea035f39ce7b377cdbb5b)
  build user:       root@7afbff271a3f
  build date:       20240521-18:36:53
  go version:       go1.22.3
  platform:         linux/arm64
  tags:             unknown

node_exporter command line flags

./node_exporter

node_exporter log output

Expand logs
ts=2024-07-07T12:37:02.195Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.1, branch=HEAD, revision=400c3979931613db930ea035f39ce7b377cdbb5b)"
ts=2024-07-07T12:37:02.195Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.3, platform=linux/arm64, user=root@7afbff271a3f, date=20240521-18:36:53, tags=unknown)"
ts=2024-07-07T12:37:02.196Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-07-07T12:37:02.197Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-07-07T12:37:02.197Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=fibrechannel
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=hwmon
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=infiniband
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=ipvs
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=netclass
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=netstat
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=nfs
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=nfsd
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=os
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=powersupplyclass
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=rapl
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=softnet
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=tapestats
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=thermal_zone
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=time
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=udp_queues
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=xfs
ts=2024-07-07T12:37:02.198Z caller=node_exporter.go:118 level=info collector=zfs
ts=2024-07-07T12:37:02.199Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9100
ts=2024-07-07T12:37:02.199Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9100

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Scraped it. And the scrape was stuck. Even after a couple of minutes the scrape didn't succeed. I narrowed it down to thermal_zone collector.

Running ./node_exporter --no-collector.thermal_zone makes the scrapes work again.

How can I debug this further?

gouthamve avatar Jul 07 '24 12:07 gouthamve

thermal_zone comes from prometheus/procfs. It walks files in /sys/class/thermal.

SuperQ avatar Jul 07 '24 13:07 SuperQ

Same issue appears when using the docker image. Trying to curl the metrics hangs unless you add the --no-collector.thermal_zone flag to command.

Using go version go1.22.5

erik-fauna avatar Nov 12 '24 21:11 erik-fauna

This issue arises because thermal zones 2, 3, and 4 are offline on the Jetson Orin Nano.

  • CV0 (zone 2) and CV1 (zone 3) correspond to the DLA0 and DLA1 temperature sensors, respectively, while CV2 (zone 4) monitors the PVA temperature. Since the Orin Nano does not have DLA or PVA hardware, these thermal zones are non-functional on the platform. For additional context, refer to this forum post
  • The device tree is shared across multiple Jetson modules, which is why these zones are still defined and exposed, even though they do not correspond to actual hardware on the Orin Nano.
$ cat /sys/class/thermal/thermal_zone2/temp 
cat: /sys/class/thermal/thermal_zone2/temp: Resource temporarily unavailable

When node_exporter attempts to read these non-functional thermal zones, it encounters an issue and gets stuck, leading to a failure to collect thermal zone data.

Here is an ugly workaround without changing the device tree: we can get it to work by skipping these zones in procfs/sysfs/class_thermal.go

@@ -47,6 +47,14 @@ func (fs FS) ClassThermalZoneStats() ([]ClassThermalZoneStats, error) {
                return nil, err
        }
 
+       for i := 0; i < len(zones); i++ {
+               zoneNum := zones[i][len(zones[i])-1] - '0'
+               if zoneNum == 2 || zoneNum == 3 || zoneNum == 4 {
+                       zones = append(zones[:i], zones[i+1:]...)
+                       i--
+               }
+       }
+       
        stats := make([]ClassThermalZoneStats, 0, len(zones))
        for _, zone := range zones {
                zoneStats, err := parseClassThermalZone(zone)

If you are okay to recompile the dtb, these zones can be disabled as shown below:

modifying device tree

in this common file nv-platform/tegra234-p3768-0000+p3767-xxxx-nv-common.dtsi thermal zones are enabled.

in the same directory, you can find the .dts file for your Jetson module ( check SKU number )

you can disable it by adding the following to the root node:

 	thermal-zones {

		cv0-thermal {
			status = "disabled";
		};

		cv1-thermal {
			status = "disabled";
		};

		cv2-thermal {
			status = "disabled";
		};

	}

I attempted to trace the root cause of the issue, and here is the backtrace for reference:

Image

It would be great if prometheus/procfs can handle this error more gracefully. Maybe we should open an issue at prometheus/procfs? @gouthamve

update: opened an issue at procfs repo.

bnbhat avatar Feb 19 '25 13:02 bnbhat

Also having this problem. I see that the fix is merged in procfs, so I tried to compile the latest commit and it works for me!

mari-rv avatar Jul 07 '25 13:07 mari-rv