gpu-operator Unable to run pod on G5 48xlarge instance, other g5 instance works well

1. Quick Debug Information

Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Openshift v4.13.23
GPU Operator Version: 23.9.1 , gpu-operator-certified.v1.11.1

2. Issue or feature description

We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as

Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected

Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.

We also see pod from nvidia-dcgm-exporter is failing with following error:

(combined from similar events): Error: container create failed: time="2023-12-13T11:17:30Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error\n"

Error: container create failed: time="2023-12-13T10:29:12Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out\n"

3. Steps to reproduce the issue

Assign pod on g5.48xlarge works, but it doesn't run

4. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
[ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
[ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
[ ] containerd logs journalctl -u containerd > containerd.log

Logs from nvidia-dcgm-exporter pod

time="2023-12-13T10:25:04Z" level=info msg="Starting dcgm-exporter"
time="2023-12-13T10:25:04Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2023-12-13T10:25:04Z" level=info msg="DCGM successfully initialized!"
time="2023-12-13T10:25:05Z" level=info msg="Collecting DCP Metrics"
time="2023-12-13T10:25:05Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-12-13T10:25:05Z" level=info msg="Initializing system entities of type: GPU"
time="2023-12-13T10:25:30Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

Logs from GPU feature discovery pod:

I1213 10:24:42.754239       1 main.go:122] Starting OS watcher.
I1213 10:24:42.754459       1 main.go:127] Loading configuration.
I1213 10:24:42.754781       1 main.go:139] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1213 10:24:42.755227       1 factory.go:48] Detected NVML platform: found NVML library
I1213 10:24:42.755282       1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1213 10:24:42.755294       1 factory.go:64] Using NVML manager
I1213 10:24:42.755301       1 main.go:144] Start running
I1213 10:24:43.018503       1 main.go:187] Creating Labels
2023/12/13 10:24:43 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I1213 10:24:43.018687       1 main.go:197] Sleeping for 60000000000
I1213 10:29:12.978389       1 main.go:119] Exiting
E1213 10:29:12.978748       1 main.go:95] error creating NVML labeler: error creating mig capability labeler: error getting mig capability: error getting MIG mode: Unknown Error

GPU cluster policy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-10-02T12:22:11Z'
  generation: 1
  managedFields:
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:gds':
            .: {}
            'f:enabled': {}
          'f:vgpuManager':
            .: {}
            'f:enabled': {}
          'f:vfioManager':
            .: {}
            'f:enabled': {}
          'f:daemonsets':
            .: {}
            'f:rollingUpdate':
              .: {}
              'f:maxUnavailable': {}
            'f:updateStrategy': {}
          'f:sandboxWorkloads':
            .: {}
            'f:defaultWorkload': {}
            'f:enabled': {}
          'f:nodeStatusExporter':
            .: {}
            'f:enabled': {}
          'f:toolkit':
            .: {}
            'f:enabled': {}
            'f:installDir': {}
          'f:vgpuDeviceManager':
            .: {}
            'f:enabled': {}
          .: {}
          'f:gfd':
            .: {}
            'f:enabled': {}
          'f:migManager':
            .: {}
            'f:enabled': {}
          'f:mig':
            .: {}
            'f:strategy': {}
          'f:operator':
            .: {}
            'f:defaultRuntime': {}
            'f:initContainer': {}
            'f:runtimeClass': {}
            'f:use_ocp_driver_toolkit': {}
          'f:dcgm':
            .: {}
            'f:enabled': {}
          'f:dcgmExporter':
            .: {}
            'f:config':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:serviceMonitor':
              .: {}
              'f:enabled': {}
          'f:sandboxDevicePlugin':
            .: {}
            'f:enabled': {}
          'f:driver':
            .: {}
            'f:certConfig':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:kernelModuleConfig':
              .: {}
              'f:name': {}
            'f:licensingConfig':
              .: {}
              'f:configMapName': {}
              'f:nlsEnabled': {}
            'f:repoConfig':
              .: {}
              'f:configMapName': {}
            'f:upgradePolicy':
              .: {}
              'f:autoUpgrade': {}
              'f:drain':
                .: {}
                'f:deleteEmptyDir': {}
                'f:enable': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:maxParallelUpgrades': {}
              'f:maxUnavailable': {}
              'f:podDeletion':
                .: {}
                'f:deleteEmptyDir': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:waitForCompletion':
                .: {}
                'f:timeoutSeconds': {}
            'f:virtualTopology':
              .: {}
              'f:config': {}
          'f:devicePlugin':
            .: {}
            'f:config':
              .: {}
              'f:default': {}
              'f:name': {}
            'f:enabled': {}
          'f:validator':
            .: {}
            'f:plugin':
              .: {}
              'f:env': {}
      manager: Mozilla
      operation: Update
      time: '2023-10-02T12:22:11Z'
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:namespace': {}
      manager: Go-http-client
      operation: Update
      subresource: status
      time: '2023-12-11T14:01:59Z'
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions': {}
          'f:state': {}
      manager: gpu-operator
      operation: Update
      subresource: status
      time: '2023-12-13T10:29:14Z'
  name: gpu-cluster-policy
  resourceVersion: '1243373036'
  uid: 1e79d1d1-cfc8-493f-bad0-4a94fa0a2da7
spec:
  vgpuDeviceManager:
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: console-plugin-nvidia-gpu
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
status:
  conditions:
    - lastTransitionTime: '2023-12-13T10:25:37Z'
      message: ''
      reason: Error
      status: 'False'
      type: Ready
    - lastTransitionTime: '2023-12-13T10:25:37Z'
      message: >-
        ClusterPolicy is not ready, states not ready: [state-dcgm-exporter
        gpu-feature-discovery]
      reason: OperandNotReady
      status: 'True'
      type: Error
  namespace: nvidia-gpu-operator
  state: notReady

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Dec 13 '23 11:12 arpitsharma-vw

@shivamerla Hi, Can you help here? Many thanks :)

Dec 14 '23 13:12 arpitsharma-vw

@arpitsharma-vw can you check dmesg on the node and report any driver errors. dmesg | grep -i nvrm. If you see GSP RM related errors please try this workaround to disable GSP RM.

Dec 21 '23 01:12 shivamerla

please try this workaround to disable GSP RM.

@shivamerla What if the driver is already installed (as is the case with EKS GPU AMI), will the driver component still try and apply the kernel module config?

Dec 21 '23 16:12 chiragjn

Many thanks @shivamerla for your input. I can confirm that we see GSP RM related errors here. But regarding the fix, we have installed the GPU operator via OLM(not Helm). I am afraid that these changes will get wiped out again on the next upgrade.

Jan 03 '24 10:01 arpitsharma-vw

Let me explain how this can be done on OpenShift:

First, create a ConfigMap as described in the doc for disabling GSP RM.

oc create configmap kernel-module-params -n nvidia-gpu-operator --from-file=nvidia.conf=./nvidia.conf

Then add the following to the ClusterPolicy:

  driver:
    <...>
    kernelModuleConfig:
      name: kernel-module-params
    <...>

You can do it either via the Web console, or using this command:

oc patch clusterpolicy/gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "add", "path": "/spec/driver/kernelModuleConfig/name", "value":"kernel-module-params"}]'

Essentially, the outcome should be the same, no matter if done via Helm or using the method I described. That is, the ClusterPolicy resource will have the right section added to it. The oc patch command above assumes that there is already a ClusterPolicy resource, but you can also add the required kernelModuleConfig section right away when creating the ClusterPolicy (via the Web console or from a file).

I believe that the changes will persist as they will be part of the ClusterPolicy. Also, the operator will probably restart the driver to pick up the changes. Please correct me if I'm wrong @shivamerla

Jan 04 '24 10:01 empovit

Same here. I'm using EKS 1.29 with the latest AMI with "a fix" https://github.com/awslabs/amazon-eks-ami/issues/1494#issuecomment-1969724714. Gpu-operator v23.9.1

Even the DCGM exporter failed to start with a message:

      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

Mar 03 '24 23:03 farioas

After applying a suggest fix with disable of GSP:

  Warning  UnexpectedAdmissionError  11s   kubelet            Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected

Mar 04 '24 00:03 farioas