DCGM
DCGM copied to clipboard
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
The metric "DCGM_FI_PROF_PIPE_FP64_ACTIVE" is defined as the "Ratio of cycles the fp32 pipe is active". I suppose the units is time here. How do we equate this to FLOPS count....
If I register an XID through DCGM's policy and listen, when a certain XID (for example, 79) occurs, will the policy keep reporting that XID until it recovers, or will...
When running extended-level diagnostics on 8 cards simultaneously, 8 H20s may occupy approximately 8GB of memory at most, while 8 H800s may occupy up to 16GB of memory. What causes...
Hi there, cuda v10 has been EOL for some time, however it appears there are still several references to it [in various places in the project](https://github.com/search?q=repo%3ANVIDIA%2FDCGM+cuda10&type=code). Are there any plans...
Running a [`3.3.5-3.4.0` exporter ](https://github.com/NVIDIA/dcgm-exporter/releases/tag/3.3.5-3.4.0) on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine. Is there something I can do? Shour that be reported to the exporter instead?...
We use dcdm-exporter as described in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-an-existing-dcgm-agent. The `nv-hostengine` is version 3.1.8, the `dcgm-exporter` container is `nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04`. We use a custom metrics file with the following metrics: ```bash # Clocks,,...
Using: ``` cuda-dcgm-libs-3.1.3.1-198_cm9.2.x86_64 cuda-dcgm-nvvs-3.1.3.1-198_cm9.2.x86_64 cuda-dcgm-3.1.3.1-198_cm9.2.x86_64 ``` The '`cm`' stands for "Cluster Manager" as in Nvidia Bright Computing (now called Base Command). The /var/log/nv-hostengine.log is filling up with these entries every...
We are running below dcgm dagnostic command in ec2 instance through a docker container. Command runs for some time (~30 mins) and exits with status code 226. No other details...
Hello! When I running the command below ` dcgmi dmon -e 1001` the result is > Error setting watches. Result: -33: This request is serviced by a module of DCGM...
Hi, I'm seeing some strange behavior of the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric with MIG instances. Namely, the maximum values vary by instance type and don't seem to make sense. Here's the maximum...