gpu-operator validator: should validate GPU healthiness by using DCGM

Feature Description

AFAIUC, validator app currently only validates the driver installation.

It would be great to have additional validating steps for the DCGM installation. It can be enabled with an optional flag, having a healthy DCGM is important to export metrics via dcgm-exporter.

Idea is to similar to this cookbook: https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh - cc @shivamerla for awareness

GPU Health Check Check GPU healthiness by executing NVIDIA DCGM diagnostic tool If GPU_DEVICE_ORDINAL is set, the diagnostic check targets the GPU listed in the variable Prerequisite for the diagnostic check are:

node has NVIDIA GPU

DCGM service is running

fabric manager service is running (if node is NVSwitch enabled)

persistent mode is enabled for the target GPU

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi diag -i 0 -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.154.05                                     |
| GPU Device IDs Detected   | 20b5                                           |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Skip                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

It should be run for each GPU: nvidia-smi -L

Feb 16 '24 13:02 Dentrax

@cdesiniotis @Dentrax I'm interested in this feature If you don't mind, can I write a PR?

Aug 30 '24 06:08 changhyuni

@changhyuni We also have to consider cases where the users may disable dcgm and dcgm-exporter in their gpu-operator deployments. So we wouldn't always be able to resort to this validation.

However, we have also begun looking into integrating gpu health monitoring components into gpu-operator. We are still in our exploratory phases, but we can keep you posted as we come out with more updates

Aug 31 '24 11:08 tariq1890

any updates on this? we would love to be to run diagnostics, but dcgmi diag raises this error:

Error when executing the diagnostic: The NVVS binary was not found in the specified location; please install it to /usr/share/nvidia-validation-suite/ or set environment variable NVVS_BIN_PATH to the directory containing nvvs. If you set NVVS_BIN_PATH, please restart the DCGM service (nv-hostengine) if it is active.
Nvvs stderr:

Jan 07 '25 20:01 hiramf

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

Nov 05 '25 00:11 github-actions[bot]