DCGM diagnostics in the container with less than 8 GPUs the test fails
Setup - DCGM checks for a Google Kubernetes pod requiring GPUs(<8) fails with the test being executed in the init container.
As part of the preflight health checks if we run DCGM diagnostics in the container with less than 8 GPUs the test fails with following error:
| Permissions and OS Blocks | Fail |
| Error | File /dev/nvidia7 could not be accessed direc |
| | tly: Operation not permitted Check relevant p |
| | ermissions, access, and existence of the file |
| | ., File /dev/nvidia2 could not be accessed di |
| | rectly: Operation not permitted Check relevan |
| | t permissions, access, and existence of the f |
| | ile., File /dev/nvidia1 could not be accessed |
| | directly: Operation not permitted Check rele |
| | vant permissions, access, and existence of th |
| | e file., File /dev/nvidia0 could not be acces |
| | sed directly: Operation not permitted Check r |
| | elevant permissions, access, and existence of |
| | the file., The number of devices NVML return |
| | s is different than the number of devices in |
| | /dev. Check for the presence of cgroups, oper |
| | ating system blocks, and or unsupported / old |
| | er cards
It seems that the user container have all the /dev/nvidia0, /dev/nvidia1, /dev/nvidia2, /dev/nvidia3 ... untill /dev/nvidia7 mounted while NML only sees 4 GPU devices. This discrepancy between the number of devices in /dev and the number of devices seen by NVML results in failure of the DCGM diagnostics.
@sanghvimanan I'm working on a fix here. Can you share details on how you created this container?
This issue has now been fixed and will released with DCGM 3.2.6.
Awesome! Thanks, David.