gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

After the GPU node is restarted, an error occurs when the nvidia-driver-daemonset pod is started in the offline environment

Open sunwuyan opened this issue 1 year ago • 4 comments

After using gpu-operator to integrate the GPU successfully, when restarting the GPU node, can I not reinstall the driver?Because my K8S cluster cannot access the public network under normal conditions, every time the nvidia-driver-daemonset pod is restarted, it needs to be connected to the network to complete the startup, otherwise the error will be reported:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 550.54.14 for Linux kernel version 5.15.0-67-generic

Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs... Checking NVIDIA driver packages... Updating the package cache... E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed. E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease Clearsigned file isn't valid, got 'NOSPLIT' (does the network require authentication?) E: The repository 'http://archive.ubuntu.com/ubuntu focal-security InRelease' is not signed. Stopping NVIDIA persistence daemon... Unloading NVIDIA driver kernel modules... Unmounting NVIDIA driver rootfs...

I tried setting driver.upgradePolicy.autoUpgrade to false and it didn't work either

sunwuyan avatar Apr 23 '24 06:04 sunwuyan

@sunwuyan the driver will always be reinstalled a reboot, this is the current limitation. Please see this comment: https://github.com/NVIDIA/gpu-operator/issues/705#issuecomment-2077761858

cdesiniotis avatar Apr 26 '24 23:04 cdesiniotis

@sunwuyan the driver will always be reinstalled a reboot, this is the current limitation. Please see this comment: #705 (comment)

3q,I looked at the code, and it seems that if the driver.usePrecompile property is set to true, it shouldn't repeat the network update,but I haven't tried it yet, my operating system is ubuntu20.04

sunwuyan avatar Apr 28 '24 04:04 sunwuyan

Correct. If precompiled drivers are used, then we do not need network connectivity to update the package cache.

However, we do not have precompiled driver images published for Ubuntu 20.04. We only have tags for Ubuntu 22.04, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions

cdesiniotis avatar Apr 29 '24 18:04 cdesiniotis

Correct. If precompiled drivers are used, then we do not need network connectivity to update the package cache.

However, we do not have precompiled driver images published for Ubuntu 20.04. We only have tags for Ubuntu 22.04, see https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#limitations-and-restrictions

3q

sunwuyan avatar Jun 03 '24 08:06 sunwuyan

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

Closing this issue as relevant questions were answered.

cdesiniotis avatar Nov 14 '25 21:11 cdesiniotis