gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

GPU resources are not recovered even XID error is resolved

Open jslouisyou opened this issue 1 year ago • 1 comments

Hello, NVIDIA team.

I recently faced an issue while GPU resources (nvidia.com/gpu) can be shown from kubelet are not recovered (e.g. 7 -> 8) even any XID error is resolved.

I got nvidia-device-plugin-daemonset from gpu-operator and I'm using gpu-operator v23.9.2.

Here's more details:

I found that there were only 7 GPU cards shown from Kubernetes, even I'm using 8 GPU cards in H100 node:

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             7       <=========== here
  pods:                       110

nvidia-device-plugin-daemonset reports that there is XID 94 error is coming out in one of GPU card:

I1025 02:19:08.002792       1 health.go:151] Skipping non-nvmlEventTypeXidCriticalError event: {Device:{Handle:0x7f0dcf40bdf8} EventType:2 EventData:0 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048144       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.048185       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.048239       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.049436       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.049451       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.049483       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.059938       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.059948       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.059980       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41
I1025 02:19:08.074343       1 health.go:160] Processing event {Device:{Handle:0x7f0dcf40bdf8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I1025 02:19:08.074366       1 health.go:186] XidCriticalError: Xid=94 on Device=GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41; marking device as unhealthy.
I1025 02:19:08.074389       1 server.go:245] 'nvidia.com/gpu' device marked unhealthy: GPU-d23d91a3-fb44-fdaa-7e44-52396b1b7e41

But after elapsed some time, it seems that XID error is somewhat resolved (I think application is restarted or removed). I can't find XID error from nvidia-smi:

$ nvidia-smi
Fri Oct 25 11:35:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1A:00.0 Off |                    2 |
| N/A   33C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:40:00.0 Off |                    0 |
| N/A   31C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:53:00.0 Off |                    0 |
| N/A   31C    P0              74W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:66:00.0 Off |                    0 |
| N/A   33C    P0              69W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9C:00.0 Off |                    0 |
| N/A   35C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C0:00.0 Off |                    0 |
| N/A   32C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D2:00.0 Off |                    0 |
| N/A   34C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:E4:00.0 Off |                    0 |
| N/A   31C    P0              71W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

But even if XID error is resolved, nvidia-device-plugin-daemonset won't try to fetch new status of GPU cards and reports to kubelet, so kubelet thinks that only some of GPU cards can be used.

After I restarted nvidia-device-plugin-daemonset pod, at then it reports kubelet that they can use 8 GPU cards (the number of nvidia.com/gpu is changed in Allocatable):

Capacity:
  cpu:                        128
  ephemeral-storage:          7441183616Ki
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2113276288Ki
  nvidia.com/gpu:             8
  pods:                       110
Allocatable:
  cpu:                        128
  ephemeral-storage:          6857794809152
  hugepages-1Gi:              0
  hugepages-2Mi:              8448Mi
  memory:                     2062682496Ki
  nvidia.com/gpu:             8       <=========== here is changed
  pods:                       110

I think nvidia-device-plugin-daemonset should fetch status correctly and report to kubelet. Could you please take a look this issue?

Thanks.

jslouisyou avatar Oct 25 '24 02:10 jslouisyou

I filed same issue in https://github.com/NVIDIA/k8s-device-plugin also:

https://github.com/NVIDIA/k8s-device-plugin/issues/1014

jslouisyou avatar Oct 25 '24 02:10 jslouisyou

I agree that it seems like Xid 94 is essentially an application error and should not disable the device. But as a workaround you can tell it to ignore this by setting the device plugin's environment variable DP_DISABLE_HEALTHCHECKS to 94.

nwilliams-bdai avatar Nov 12 '24 21:11 nwilliams-bdai

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

I am closing this issue as https://github.com/NVIDIA/k8s-device-plugin/issues/1014 is a better place to track this. As suggested in https://github.com/NVIDIA/gpu-operator/issues/1065#issuecomment-2471652615 you can configure the device-plugin to ignore this XID error.

cdesiniotis avatar Nov 15 '25 02:11 cdesiniotis