deepops icon indicating copy to clipboard operation
deepops copied to clipboard

Upgrading NVIDIA Driver without reseting cluster

Open Heegreis opened this issue 1 year ago • 2 comments

I installed DeepOps v23.08 last year, using K8s and the default settings. But now I need to upgrade the NVIDIA Driver so I can use a newer version of CUDA in the container. How can I upgrade the Driver without resetting the cluster?

Heegreis avatar Jun 07 '24 04:06 Heegreis

I don't thinks it's necasery to reboot the entire of cluster, but you need to reboot the node , You can taint the node invidiualty and just run command to upgrade nvidia driver and cuda version:

ansible-playbook playbooks/nvidia-software/nvidia-driver.yml -e -e nvidia_driver_force_install=True [-l <list-of-nodes>] ansible-playbook playbooks/nvidia-software/nvidia-cuda.yml [-l <list-of-nodes>]

v-ducnt69 avatar Aug 01 '24 09:08 v-ducnt69

I have studied these two yml files. Will this be directly upgraded to the latest NVIDIA Driver? Or is the version based on gpu_operator_driver_version of roles/nvidia-gpu-operator/defaults/main.yml? Because I use DeepOps in production environment, I am afraid to break it. Thanks for your reply.

Heegreis avatar Aug 13 '24 02:08 Heegreis

This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.

github-actions[bot] avatar Oct 13 '24 01:10 github-actions[bot]