error code CUDA driver version is insufficient for CUDA runtime version in v22.9.0
The issue is still reproduced in gpu-operator v22.9.0.
kubectl --kubeconfig -n gpu logs cuda-vectoradd
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]
Environment infomation
OS Version: Red Hat Enterprise Linux release 8.4
kernel: 4.18.0-305.el8.x86_64
K3S Version: v1.24.3+k3s1
GPU Operator Version: v22.9.0
CUDA Version: 11.7.1-base-ubi8
Driver Pre-installed: No
Driver Version:515.65.01-rhel8.4
Container-Toolkit Pre-installed: No
Container-Toolkit Version: v1.11.0-ubi8
GPU Type: Tesla P100
cuda-sample: cuda-sample:vectoradd-cuda11.7.1-ubi8
config.toml content
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
In the host, the /etc/nvidia-container-runtime/host-files-for-container.d is not found.
cuda-vectoradd pod yaml
cat << EOF | kubectl --kubeconfig /work/k3s.yaml create -n hsc-gpu -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia <<<<<<<<<
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8"
resources:
limits:
nvidia.com/gpu: 1
EOF
when I add runtimeClassName: nvidia in Pod spec, it works.
issue: https://github.com/NVIDIA/gpu-operator/issues/408
Dose gpu-operator support on k3s cluster environment?
@shivamerla @cdesiniotis Could you please help me out? Thank you very much.
Images list
"nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0",
"nvcr.io/nvidia/gpu-operator:v22.9.0",
"nvcr.io/nvidia/cuda:11.7.1-base-ubi8",
"nvcr.io/nvidia/driver:515.65.01-rhel8.4",
"nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2",
"nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8",
"nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8",
"nvcr.io/nvidia/cloud-native/dcgm:3.0.4-1-ubi8",
"nvcr.io/nvidia/k8s/dcgm-exporter:3.0.4-3.0.0-ubi8",
"nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8",
"nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.0-ubi8",
"nvcr.io/nvidia/cloud-native/vgpu-device-manager:v0.2.0",
"nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.1",
"k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"
@carlwang87 For running cuda-vectorAdd sample, use this image nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0. This is the image we have fixed(and used) with GPU Operator to run the sample. K3s needs custom containerd config set with container-toolkit. Please find more details under this section for K3s: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd
Follow the document toolkit configuration, it failed either.
cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins.cri.containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
[plugins.cri.containerd.runtimes."nvidia-experimental"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
The content is not same as the document from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd
version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
@shivamerla
The k3s 1.24 I used also encountered config The toml file has not been modified. However, when I change CONTAINERD_CONFIG to config.toml.tmpl, it works, this file will be created and modified. However, after restarting k3s, this file will overwrite config.toml, resulting in the lack of many necessary k3s configurations
As far as I know, k3s config.toml is generated by k3s. If need to customize it, need to modify config.toml.tmpl about k3s config.toml.tmpl link
By the way, k3s here is a single node
Did I do something wrong?
@shivamerla
@shivamerla
Without pre-installed NVIDIA Container Toolkit and gpu driver, I followed the gpu-operator(v22.9.0) installation guide in k3s(v1.24.3+k3s1) to deploy gpu operator successfully, but I ran the samples from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications, it failed. I have to add runtimeClassName: nvidia in the pod spec,so I wonder that how these samples ran succeessfully without runtimeClassName: nvidia in k3s cluster? Have you test these samples on k3s cluster?
@carlwang87 There seems to be typo in the documentation. Boolean value for CONTAINERD_SET_AS_DEFAULT needs to be quoted here. Can you double check if this was done with your install?
helm install -n gpu-operator --create-namespace \
nvidia/gpu-operator $HELM_OPTIONS \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value="true"
@shivamerla I set it through value.yaml as below:
toolkit:
enabled: true
repository: nvcr.io/nvidia/k8s
image: container-toolkit
version: v1.11.0-ubi8
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
Same issue talked is found here: https://github.com/k3s-io/k3s/issues/4391
not sure why you still had to specify runtimeClassName in the sample pod. With CONTAINERD_SET_AS_DEFAULT enabled, we set default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml. Can you confirm that is set in config.toml?
@shivamerla First of all, thank you for still helping me.
With
CONTAINERD_SET_AS_DEFAULTenabled, we setdefault_runtime_name=nvidiain/var/lib/rancher/k3s/agent/etc/containerd/config.toml
default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml is set manually or by gpu operator? I find it is not set by gpu operator. You mean we need to set it manually in config.toml?
@carlwang87 It is set by the container-toolkit component deployed with gpu-operator. Can you describe the container-toolkit pod to confirm these env are applied correct? Also, logs from that pod please.
@shivamerla logs from pod nvidia-container-toolkit-daemonset-cwp28:
env:

logs:
time="2022-10-20T12:36:25Z" level=info msg="Starting nvidia-toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Parsing arguments"
time="2022-10-20T12:36:25Z" level=info msg="Verifying Flags"
time="2022-10-20T12:36:25Z" level=info msg=Initializing
time="2022-10-20T12:36:25Z" level=info msg="Installing toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2022-10-20T12:36:25Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-ml.so (root=/run/nvidia/driver)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/run/nvidia/driver/usr/lib64/libnvidia-ml.so' => '/run/nvidia/driver/usr/lib64/libnvidia-ml.so.515.65.01'"
time="2022-10-20T12:36:25Z" level=info msg="Using library root /run/nvidia/driver/usr/lib64"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
time="2022-10-20T12:36:25Z" level=info msg="Setting up runtime"
time="2022-10-20T12:36:25Z" level=info msg="Starting 'setup' for containerd"
time="2022-10-20T12:36:25Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-10-20T12:36:25Z" level=info msg="Successfully parsed arguments"
time="2022-10-20T12:36:25Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
time="2022-10-20T12:36:25Z" level=info msg="Successfully loaded config"
time="2022-10-20T12:36:25Z" level=info msg="Config version: 1"
time="2022-10-20T12:36:25Z" level=warning msg="Support for containerd config version 1 is deprecated"
time="2022-10-20T12:36:25Z" level=info msg="Updating config"
time="2022-10-20T12:36:25Z" level=info msg="Successfully updated config"
time="2022-10-20T12:36:25Z" level=info msg="Flushing config"
time="2022-10-20T12:36:25Z" level=info msg="Successfully flushed config"
time="2022-10-20T12:36:25Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-10-20T12:36:25Z" level=info msg="Successfully signaled containerd"
time="2022-10-20T12:36:25Z" level=info msg="Completed 'setup' for containerd"
time="2022-10-20T12:36:25Z" level=info msg="Waiting for signal"
[root@localhost ~]#
From the logs, I find there is a warning that Support for containerd config version 1 is deprecated.
logs from nfd worker:

What version of k3s does the gpu operator use when testing this point? The 1.24 version we use has this problem. In addition, when I do not set CONTAINERD_CONFIG , /etc/containerd/config.toml will be changed by default , the change is correct at this time, but it does not meet our expectations, because the k3s containerd configuration file is in/var/lib/router/k3s/agent/etc/containerd/config.toml
@shivamerla
@shivamerla
I have set up a K3S cluster environment with GPU. In this cluster, it can reproduce the issue, no default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml.
K3S Version:
v1.25.3+k3s1
GPU Operator:
v22.9.0
So, Could you please help me to find the root cause out on this environment?
ssh: 104.207.150.41
user: root
password: 8gF?cnkLnnh9gLh]
GPU Pod:
kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml -n gpu-operator get pod

You can do anything on this environment, please help me, thank you very much.
@carlwang87 looks like in this config v1 containerd format is used. We need to pass env CONTAINERD_USE_LEGACY_CONFIG as "true". By default container-toolkit assumes v2 format. After this setting i see that default_runtime is set as nvidia.
[root@vultr ~]# cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
version = 1
[plugins]
[plugins.cri]
enable_selinux = false
enable_unprivileged_icmp = false
enable_unprivileged_ports = false
sandbox_image = "rancher/mirrored-pause:3.6"
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
[plugins.cri.cni]
bin_dir = "/var/lib/rancher/k3s/data/2ef87ff954adbb390309ce4dc07500f29c319f84feec1719bfb5059c8808ec6a/bin"
conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"
[plugins.cri.containerd]
disable_snapshot_annotations = true
snapshotter = "overlayfs"
[plugins.cri.containerd.default_runtime]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.default_runtime.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.nvidia.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
SystemdCgroup = false
[plugins.cri.containerd.runtimes.nvidia-experimental]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.nvidia-experimental.options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
SystemdCgroup = false
[plugins.cri.containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.runc.options]
SystemdCgroup = false
[plugins.opt]
path = "/var/lib/rancher/k3s/agent/containerd"
I haved the same error with the Gpu Operator example but If I try with following example all works fine
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
With k3s you need to update the config.toml.tmpl with the default runtime, not config.toml directly.
I haved the same error with the Gpu Operator example but If I try with following example all works fine
apiVersion: v1 kind: Pod metadata: name: cuda-vector-add spec: runtimeClassName: nvidia restartPolicy: OnFailure containers: - name: cuda-vector-add env: - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: compute,utility # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1
Thank you, I was missing runtimeClassName: nvidia in my case.
@shivamerla are the env vars required when using k3s? I was having the same issue (k3s v1.27.2+k3s1) until I added them.
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
Same issue with gpu-operator v23.3.2