Unable to run pod on G5 48xlarge instance, other g5 instance works well
1. Quick Debug Information
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Openshift v4.13.23
- GPU Operator Version: 23.9.1 , gpu-operator-certified.v1.11.1
2. Issue or feature description
We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as
Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected
Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.
We also see pod from nvidia-dcgm-exporter is failing with following error:
(combined from similar events): Error: container create failed: time="2023-12-13T11:17:30Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error\n"
Error: container create failed: time="2023-12-13T10:29:12Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out\n"
3. Steps to reproduce the issue
Assign pod on g5.48xlarge works, but it doesn't run
4. Information to attach (optional if deemed irrelevant)
- [ ] kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - [ ] kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - [ ] If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - [ ] If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - [ ] Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - [ ] containerd logs
journalctl -u containerd > containerd.log
Logs from nvidia-dcgm-exporter pod
time="2023-12-13T10:25:04Z" level=info msg="Starting dcgm-exporter"
time="2023-12-13T10:25:04Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2023-12-13T10:25:04Z" level=info msg="DCGM successfully initialized!"
time="2023-12-13T10:25:05Z" level=info msg="Collecting DCP Metrics"
time="2023-12-13T10:25:05Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-12-13T10:25:05Z" level=info msg="Initializing system entities of type: GPU"
time="2023-12-13T10:25:30Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
Logs from GPU feature discovery pod:
I1213 10:24:42.754239 1 main.go:122] Starting OS watcher.
I1213 10:24:42.754459 1 main.go:127] Loading configuration.
I1213 10:24:42.754781 1 main.go:139]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
}
}
I1213 10:24:42.755227 1 factory.go:48] Detected NVML platform: found NVML library
I1213 10:24:42.755282 1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1213 10:24:42.755294 1 factory.go:64] Using NVML manager
I1213 10:24:42.755301 1 main.go:144] Start running
I1213 10:24:43.018503 1 main.go:187] Creating Labels
2023/12/13 10:24:43 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I1213 10:24:43.018687 1 main.go:197] Sleeping for 60000000000
I1213 10:29:12.978389 1 main.go:119] Exiting
E1213 10:29:12.978748 1 main.go:95] error creating NVML labeler: error creating mig capability labeler: error getting mig capability: error getting MIG mode: Unknown Error
GPU cluster policy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
creationTimestamp: '2023-10-02T12:22:11Z'
generation: 1
managedFields:
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:spec':
'f:gds':
.: {}
'f:enabled': {}
'f:vgpuManager':
.: {}
'f:enabled': {}
'f:vfioManager':
.: {}
'f:enabled': {}
'f:daemonsets':
.: {}
'f:rollingUpdate':
.: {}
'f:maxUnavailable': {}
'f:updateStrategy': {}
'f:sandboxWorkloads':
.: {}
'f:defaultWorkload': {}
'f:enabled': {}
'f:nodeStatusExporter':
.: {}
'f:enabled': {}
'f:toolkit':
.: {}
'f:enabled': {}
'f:installDir': {}
'f:vgpuDeviceManager':
.: {}
'f:enabled': {}
.: {}
'f:gfd':
.: {}
'f:enabled': {}
'f:migManager':
.: {}
'f:enabled': {}
'f:mig':
.: {}
'f:strategy': {}
'f:operator':
.: {}
'f:defaultRuntime': {}
'f:initContainer': {}
'f:runtimeClass': {}
'f:use_ocp_driver_toolkit': {}
'f:dcgm':
.: {}
'f:enabled': {}
'f:dcgmExporter':
.: {}
'f:config':
.: {}
'f:name': {}
'f:enabled': {}
'f:serviceMonitor':
.: {}
'f:enabled': {}
'f:sandboxDevicePlugin':
.: {}
'f:enabled': {}
'f:driver':
.: {}
'f:certConfig':
.: {}
'f:name': {}
'f:enabled': {}
'f:kernelModuleConfig':
.: {}
'f:name': {}
'f:licensingConfig':
.: {}
'f:configMapName': {}
'f:nlsEnabled': {}
'f:repoConfig':
.: {}
'f:configMapName': {}
'f:upgradePolicy':
.: {}
'f:autoUpgrade': {}
'f:drain':
.: {}
'f:deleteEmptyDir': {}
'f:enable': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:maxParallelUpgrades': {}
'f:maxUnavailable': {}
'f:podDeletion':
.: {}
'f:deleteEmptyDir': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:waitForCompletion':
.: {}
'f:timeoutSeconds': {}
'f:virtualTopology':
.: {}
'f:config': {}
'f:devicePlugin':
.: {}
'f:config':
.: {}
'f:default': {}
'f:name': {}
'f:enabled': {}
'f:validator':
.: {}
'f:plugin':
.: {}
'f:env': {}
manager: Mozilla
operation: Update
time: '2023-10-02T12:22:11Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
.: {}
'f:namespace': {}
manager: Go-http-client
operation: Update
subresource: status
time: '2023-12-11T14:01:59Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions': {}
'f:state': {}
manager: gpu-operator
operation: Update
subresource: status
time: '2023-12-13T10:29:14Z'
name: gpu-cluster-policy
resourceVersion: '1243373036'
uid: 1e79d1d1-cfc8-493f-bad0-4a94fa0a2da7
spec:
vgpuDeviceManager:
enabled: true
migManager:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: console-plugin-nvidia-gpu
enabled: true
serviceMonitor:
enabled: true
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: ''
nlsEnabled: false
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
status:
conditions:
- lastTransitionTime: '2023-12-13T10:25:37Z'
message: ''
reason: Error
status: 'False'
type: Ready
- lastTransitionTime: '2023-12-13T10:25:37Z'
message: >-
ClusterPolicy is not ready, states not ready: [state-dcgm-exporter
gpu-feature-discovery]
reason: OperandNotReady
status: 'True'
type: Error
namespace: nvidia-gpu-operator
state: notReady
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
@shivamerla Hi, Can you help here? Many thanks :)
@arpitsharma-vw can you check dmesg on the node and report any driver errors. dmesg | grep -i nvrm. If you see GSP RM related errors please try this workaround to disable GSP RM.
please try this workaround to disable GSP RM.
@shivamerla What if the driver is already installed (as is the case with EKS GPU AMI), will the driver component still try and apply the kernel module config?
Many thanks @shivamerla for your input. I can confirm that we see GSP RM related errors here. But regarding the fix, we have installed the GPU operator via OLM(not Helm). I am afraid that these changes will get wiped out again on the next upgrade.
Let me explain how this can be done on OpenShift:
First, create a ConfigMap as described in the doc for disabling GSP RM.
oc create configmap kernel-module-params -n nvidia-gpu-operator --from-file=nvidia.conf=./nvidia.conf
Then add the following to the ClusterPolicy:
driver:
<...>
kernelModuleConfig:
name: kernel-module-params
<...>
You can do it either via the Web console, or using this command:
oc patch clusterpolicy/gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "add", "path": "/spec/driver/kernelModuleConfig/name", "value":"kernel-module-params"}]'
Essentially, the outcome should be the same, no matter if done via Helm or using the method I described. That is, the ClusterPolicy resource will have the right section added to it. The oc patch command above assumes that there is already a ClusterPolicy resource, but you can also add the required kernelModuleConfig section right away when creating the ClusterPolicy (via the Web console or from a file).
I believe that the changes will persist as they will be part of the ClusterPolicy. Also, the operator will probably restart the driver to pick up the changes. Please correct me if I'm wrong @shivamerla
Same here. I'm using EKS 1.29 with the latest AMI with "a fix" https://github.com/awslabs/amazon-eks-ami/issues/1494#issuecomment-1969724714. Gpu-operator v23.9.1
Even the DCGM exporter failed to start with a message:
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown
After applying a suggest fix with disable of GSP:
Warning UnexpectedAdmissionError 11s kubelet Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected