DCGM issues

Metrics around capturing gpu FLOPS

4

The metric "DCGM_FI_PROF_PIPE_FP64_ACTIVE" is defined as the "Ratio of cycles the fp32 pipe is active". I suppose the units is time here. How do we equate this to FLOPS count....

krishh85

a question about dcgm policy listening for xid

2

If I register an XID through DCGM's policy and listen, when a certain XID (for example, 79) occurs, will the policy keep reporting that XID until it recovers, or will...

BetaZYN

Memory usage by dcgm during runtime diagnostics

2

When running extended-level diagnostics on 8 cards simultaneously, 8 H20s may occupy approximately 8GB of memory at most, while 8 H800s may occupy up to 16GB of memory. What causes...

BetaZYN

Removal of dependencies on cuda v10

7

Hi there, cuda v10 has been EOL for some time, however it appears there are still several references to it [in various places in the project](https://github.com/search?q=repo%3ANVIDIA%2FDCGM+cuda10&type=code). Are there any plans...

mamccorm

dcgm-exporter crashes hostengine.

25

Running a [`3.3.5-3.4.0` exporter ](https://github.com/NVIDIA/dcgm-exporter/releases/tag/3.3.5-3.4.0) on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine. Is there something I can do? Shour that be reported to the exporter instead?...

krono

Errors in nv-hostengine log

7

We use dcdm-exporter as described in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-an-existing-dcgm-agent. The `nv-hostengine` is version 3.1.8, the `dcgm-exporter` container is `nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04`. We use a custom metrics file with the following metrics: ```bash # Clocks,,...

itzsimpl

log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8

8

Using: ``` cuda-dcgm-libs-3.1.3.1-198_cm9.2.x86_64 cuda-dcgm-nvvs-3.1.3.1-198_cm9.2.x86_64 cuda-dcgm-3.1.3.1-198_cm9.2.x86_64 ``` The '`cm`' stands for "Cluster Manager" as in Nvidia Bright Computing (now called Base Command). The /var/log/nv-hostengine.log is filling up with these entries every...

LinuxPersonEC

dcgm dagnostic command exits with status 226

1

We are running below dcgm dagnostic command in ec2 instance through a docker container. Command runs for some time (~30 mins) and exits with status code 226. No other details...

rajeshvenkata

How to get the module profile loaded?

8

Hello! When I running the command below ` dcgmi dmon -e 1001` the result is > Error setting watches. Result: -33: This request is serviced by a module of DCGM...

jxh314

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

10

Hi, I'm seeing some strange behavior of the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric with MIG instances. Namely, the maximum values vary by instance type and don't seem to make sense. Here's the maximum...

neggert

DCGM
DCGM copied to clipboard

Metadata

Metrics around capturing gpu FLOPS

a question about dcgm policy listening for xid

Memory usage by dcgm during runtime diagnostics

Removal of dependencies on cuda v10

dcgm-exporter crashes hostengine.

Errors in nv-hostengine log

log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8

dcgm dagnostic command exits with status 226

How to get the module profile loaded?

DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG

← Metadata

Owner

Metadata

DCGM DCGM copied to clipboard

Metadata

← Metadata

Owner

Metadata

DCGM
DCGM copied to clipboard