Upgrading NVIDIA Driver without reseting cluster
I installed DeepOps v23.08 last year, using K8s and the default settings. But now I need to upgrade the NVIDIA Driver so I can use a newer version of CUDA in the container. How can I upgrade the Driver without resetting the cluster?
I don't thinks it's necasery to reboot the entire of cluster, but you need to reboot the node , You can taint the node invidiualty and just run command to upgrade nvidia driver and cuda version:
ansible-playbook playbooks/nvidia-software/nvidia-driver.yml -e -e nvidia_driver_force_install=True [-l <list-of-nodes>] ansible-playbook playbooks/nvidia-software/nvidia-cuda.yml [-l <list-of-nodes>]
I have studied these two yml files. Will this be directly upgraded to the latest NVIDIA Driver? Or is the version based on gpu_operator_driver_version of roles/nvidia-gpu-operator/defaults/main.yml?
Because I use DeepOps in production environment, I am afraid to break it.
Thanks for your reply.
This issue is stale because it has been open for 60 days with no activity. Please update the issue or it will be closed in 7 days.