DCGM icon indicating copy to clipboard operation
DCGM copied to clipboard

Diagnostics fail expecting NvLink on non-NvLink systems

Open plopresti opened this issue 3 years ago • 3 comments

I am running datacenter-gpu-manager-2.4.6 from https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/

My O/S is AlmaLinux 8.6, CUDA 11.7.1. GPUs are 2x A100.

When I start the nvidia-dcgm service and run "dcgmi diag -r 2", the PCIe test always fails with a bunch of warnings like:

+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Fail - All                                     |
| Warning                   | GPU 0GPU 0's NvLink link 0 is currently down   |
|                           | Run a field diagnostic on the GPU., GPU 0GPU   |
|                           | 0's NvLink link 1 is currently down Run a fie  |
|                           | ld diagnostic on the GPU., GPU 0GPU 0's NvLin  |
|                           | k link 2 is currently down Run a field diagno  |
|                           | stic on the GPU., GPU 0GPU 0's NvLink link 3   |
|                           | is currently down Run a field diagnostic on t  |
|                           | he GPU., GPU 0GPU 0's NvLink link 4 is curren  |
|                           | tly down Run a field diagnostic on the GPU.,   
...

This is on any GPU system without NvLink. I have tried several.

How can I get DCGM's diagnostics not to try to use NvLink when it is not present?

plopresti avatar Sep 19 '22 20:09 plopresti

Hi, as of DCGM 2.4.7, you can turn off this check by adding '-p pcie.test_nvlink_status=false' to your dcgmi diag line. As for now, can you paste the output of nvidia-smi -q ?

dbeer avatar Sep 20 '22 13:09 dbeer

@dbeer Thank you for your reply. Here is the output of nvidia-smi -q:

nvidia-smi-q.txt

I have a hunch that this call to nvmlDeviceGetLinkState is showing the link as disabled instead of unsupported. This theory is consistent with the output of "dcgmi nvlink -s" on my systems.

As an aside, I have found that I can add "NVreg_NvLinkDisable=1" to the nvidia kernel module parameters to convince these tests to pass. This is perhaps not the best solution, though.

Out of curiosity, are you able to reproduce this?

plopresti avatar Sep 20 '22 16:09 plopresti

Adding that parameter is a good workaround if you do not have NVLinks on the board. If it's okay with you, then it should be alright. As I said before, 2.4.7 has the option to simply skip this test as well.

We have been able to reproduce this on certain systems that are meant to have NVLinks, but do not, because as you noticed they are reported as disabled instead of unsupported, and DCGM relies on NVML for this information. Internally, we use the option to skip this test.

dbeer avatar Sep 20 '22 18:09 dbeer