GPU resources are not recovered even XID error is resolved
Hello, NVIDIA team.
I recently faced an issue while GPU resources (nvidia.com/gpu) can be shown from kubelet are not recovered (e.g. 7 -> 8) even any XID error is resolved.
I got nvidia-device-plugin-daemonset from gpu-operator and I'm using gpu-operator v23.9.2.
Here's more details:
I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:
Capacity:
cpu: 128
ephemeral-storage: 7441183616Ki
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2113276288Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 128
ephemeral-storage: 6857794809152
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2062682496Ki
nvidia.com/gpu: 7 <=========== here
pods: 110
nvidia-device-plugin-daemonset reports that there is XID 94 error is coming out in one of GPU card:
I1025 02:19:08.002792 1 health.go:151] Skipping non-nvmlEventTypeXidCriticalError event: {Device:{Handle:0x7f0dcf40bdf8} EventType:2 EventData:0 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048144 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048185 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.048239 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.049436 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.049451 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.049483 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.059938 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.059948 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.059980 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.074343 1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.074366 1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.074389 1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi:
$ nvidia-smi
Fri Oct 25 11:35:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:1A:00.0 Off | 2 |
| N/A 33C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:40:00.0 Off | 0 |
| N/A 31C P0 70W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:53:00.0 Off | 0 |
| N/A 31C P0 74W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:66:00.0 Off | 0 |
| N/A 33C P0 69W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:9C:00.0 Off | 0 |
| N/A 35C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:C0:00.0 Off | 0 |
| N/A 32C P0 68W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:D2:00.0 Off | 0 |
| N/A 34C P0 70W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:E4:00.0 Off | 0 |
| N/A 31C P0 71W / 700W | 4MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
But even if XID error is resolved, nvidia-device-plugin-daemonset won't try to fetch new status of GPU cards and reports to kubelet, so kubelet thinks that only some of GPU cards can be used.
After I restarted nvidia-device-plugin-daemonset pod, at then it reports kubelet that they can use 8 GPU cards (the number of nvidia.com/gpu is changed in Allocatable):
Capacity:
cpu: 128
ephemeral-storage: 7441183616Ki
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2113276288Ki
nvidia.com/gpu: 8
pods: 110
Allocatable:
cpu: 128
ephemeral-storage: 6857794809152
hugepages-1Gi: 0
hugepages-2Mi: 8448Mi
memory: 2062682496Ki
nvidia.com/gpu: 8 <=========== here is changed
pods: 110
I think nvidia-device-plugin-daemonset should fetch status correctly and report to kubelet.
Could you please take a look this issue?
Thanks.
I filed same issue in https://github.com/NVIDIA/k8s-device-plugin also:
https://github.com/NVIDIA/k8s-device-plugin/issues/1014
I agree that it seems like Xid 94 is essentially an application error and should not disable the device.
But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS to 94.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.
I am closing this issue as https://github.com/NVIDIA/k8s-device-plugin/issues/1014 is a better place to track this. As suggested in https://github.com/NVIDIA/gpu-operator/issues/1065#issuecomment-2471652615 you can configure the device-plugin to ignore this XID error.