gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation

Open sfxworks opened this issue 3 years ago • 16 comments

1. Quick Debug Checklist

  • [ ] Are you running on an Ubuntu 18.04 node?
  • [x] Are you running Kubernetes v1.13+?
  • [x] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • [ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • [x] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Deployed with helm, the operator attempts to reference a deprecated API object which prevents deployment.

As noted in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#runtimeclass-v125 Nodes are now v1

kubectl get node home-2cf05d8a44a0 -o yaml | head -2                                                                                                                                                                                                                                             
apiVersion: v1
kind: Node

The operator cannot reconcile and deployment of a pod requesting a GPU fails as result

1.6704367753266153e+09  INFO    controllers.ClusterPolicy       Checking GPU state labels on the node   {"NodeName": "home-2cf05d8a44a0"}
1.6704367753266478e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.node-status-exporter", " value=": "true"}
1.670436775326656e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.operator-validator", " value=": "true"}
1.6704367753266625e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.driver", " value=": "true"}
1.6704367753266687e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.gpu-feature-discovery", " value=": "true"}
1.6704367753266747e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.container-toolkit", " value=": "true"}
1.670436775326681e+09   INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.device-plugin", " value=": "true"}
1.6704367753266864e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm", " value=": "true"}
1.6704367753266923e+09  INFO    controllers.ClusterPolicy        -      {"Label=": "nvidia.com/gpu.deploy.dcgm-exporter", " value=": "true"}
1.67043677532671e+09    INFO    controllers.ClusterPolicy       Number of nodes with GPU label  {"NodeCount": 1}
1.6704367753267498e+09  INFO    controllers.ClusterPolicy       Using container runtime: crio
1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

2. Steps to reproduce the issue

  1. Run Kubernetes 1.25
  2. Deploy the helm operator

sfxworks avatar Dec 07 '22 18:12 sfxworks

According to this https://github.com/NVIDIA/gpu-operator/issues/401#issuecomment-1245932303 this change was applied, but the helm chart may not be referencing the latest image by default

sfxworks avatar Dec 07 '22 18:12 sfxworks

@sfxworks what version of GPU Operator are you using? We migrated to node.k8s.io/v1 in v22.9.0

cdesiniotis avatar Dec 07 '22 20:12 cdesiniotis

devel-ubi8 according to https://github.com/NVIDIA/gpu-operator/blob/master/deployments/gpu-operator/values.yaml#L50

sfxworks avatar Dec 07 '22 22:12 sfxworks

nvidia-driver-daemonset-ttzrt 0/1 Init:0/1 0 22s 10.0.7.146 home-2cf05d8a44a0 <none> <none>

The tag you linked worked.

Though now other images are having issues with their defaults

  Normal   Pulling    70s (x4 over 2m39s)  kubelet            Pulling image "nvcr.io/nvidia/driver:525.60.13-"
  Warning  Failed     68s (x4 over 2m37s)  kubelet            Failed to pull image "nvcr.io/nvidia/driver:525.60.13-": rpc error: code = Unknown desc = reading manifest 525.60.13- in nvcr.io/nvidia/driver: manifest unknown: manifest unknown

Is there a publicly viewable way to see your registry's tags to resolve this quicker? They just time out.

sfxworks avatar Dec 07 '22 23:12 sfxworks

.. Changing the version of the driver to latest in the helm chat adds a -, leading to an invalid image image: nvcr.io/nvidia/driver:latest-

      containers:
      - args:
        - init
        command:
        - nvidia-driver
        image: nvcr.io/nvidia/driver:latest-
        imagePullPolicy: IfNotPresent
        name: nvidia-driver-ctr
        resources: {}
        securityContext:
          privileged: true

sfxworks avatar Dec 07 '22 23:12 sfxworks

It doesn't like my kernel anyway I guess :/

Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 6.0.11-hardened1-1-hardened

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

sfxworks avatar Dec 07 '22 23:12 sfxworks

Switching over the machine to linux vs liunx hardened with the above adjustments seems successful. Between then and now I did not have to adjust the daemonset either.

    nvidia.com/gpu.compute.major: "7"
    nvidia.com/gpu.compute.minor: "5"
    nvidia.com/gpu.count: "1"
    nvidia.com/gpu.deploy.container-toolkit: "true"
    nvidia.com/gpu.deploy.dcgm: "true"
    nvidia.com/gpu.deploy.dcgm-exporter: "true"
  Resource           Requests       Limits
  --------           --------       ------
  cpu                3300m (30%)    3500m (31%)
  memory             12488Mi (19%)  12638Mi (19%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-1Gi      0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
  nvidia.com/gpu     0              0

sfxworks avatar Dec 08 '22 14:12 sfxworks

@sfxworks for installing the latest helm charts, please refer to: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator.

We append a -<os> suffix (e.g. -ubuntu20.04) to match the OS of your worker nodes. We depend on labels from NFD (feature.node.kubernetes.io/system-os_release.ID and feature.node.kubernetes.io/system-os_release.VERSION_ID) for getting this information. If only - was appended then its possible these labels were missing. Concerning the kernel version, the driver container requires several kernel packages (i.e. kernel-devel). From your logs, it appears it could not find these packages for 6.0.11-hardened1-1-hardened. A workaround is to pass a custom repository file to the driver pod so it can properly find packages for the particular kernel. Following page has some details on how to do this: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/appendix.html#local-package-repository

cdesiniotis avatar Dec 08 '22 16:12 cdesiniotis

I have feature.node.kubernetes.io/system-os_release.ID: arch though I do not have feature.node.kubernetes.io/system-os_release.VERSION_ID on any nodes (some manjaro based, some arch based). I cannot remember how I had this working before...

sfxworks avatar Mar 28 '23 07:03 sfxworks

I just installed GPU Operator with helm version v23.3.1. This version use nvcr.io/nvidia/gpu-operator:devel-ubi8 image which has exactly this error:

1.6704367755844975e+09  ERROR   controller.clusterpolicy-controller     Reconciler error        {"name": "cluster-policy", "namespace": "", "error": "no matches for kind \"RuntimeClass\" in version \"node.k8s.io/v1beta1\""}

When I change GPU Operator to v22.9.2 version, it use nvcr.io/nvidia/gpu-operator:v22.9.0 image and the error disappeared. Can you please check it again @cdesiniotis

DatCanCode avatar May 12 '23 04:05 DatCanCode

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

berlincount avatar Jul 30 '23 10:07 berlincount

I'm also running into this issue ... on both release-23.03 and master branches. microk8s v1.27.2 on Ubuntu 22.04.2 LTS.

Same error for us with EKS 1.27 and Ubuntu 22

acesir avatar Aug 01 '23 11:08 acesir

Same error with release 23.3.1 any solution ...?

shnigam2 avatar Aug 31 '23 17:08 shnigam2

also running into this on Amazon Linux 2 - any known solution or workaround, something missing in the docs? trying to override the api version or look at the daemonset values next

release v24.6.1 - nvcr.io/nvidia/gpu-operator:devel-ubi8

ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}

kubectl describe node GPU-NODE | grep system feature.node.kubernetes.io/system-os_release.ID=amzn feature.node.kubernetes.io/system-os_release.VERSION_ID=2 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=2

node: yum list installed | grep kernel kernel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-devel.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10 kernel-headers.x86_64 5.10.220-209.869.amzn2 @amzn2extra-kernel-5.10

robjcook avatar Aug 13 '24 00:08 robjcook

@robjcook it looks like you have deployed the local helm chart that is checked into the gpu-operator main branch. We don't recommend using that helm chart.

Please use the helm chart from the official helm repo as instructed here

tariq1890 avatar Aug 14 '24 00:08 tariq1890

few node toleration things worked passed and changed to helm chart in the official helm repo

running into issue now where operator seems to be looking for image that does not exist and fails to pull

ImagePullBackOff (Back-off pulling image "nvcr.io/nvidia/driver:550.90.07-amzn2")

which image do recommend for Amazon Linux 2 node and where to specify instead of dynamically let operator interpret from the node?

edit: after digging through documentation looking into this now

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html

robjcook avatar Oct 11 '24 21:10 robjcook

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

Closing this stale issue. Please try out the latest release of GPU Operator and create a new issue if necessary.

rajathagasthya avatar Nov 12 '25 23:11 rajathagasthya