Cuda 12.8 Multi GPU Blackwell/A100 Fails
NVIDIA Open GPU Kernel Modules Version
575.51.03 and 570.133.20 and 570.124.06
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Description: Ubuntu 22.04.3 LTS
Kernel Release
6.8.0-59-generic #61~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 17:03:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, GPU 1: NVIDIA A100 80GB PCIe
Describe the bug
I am not able to call get available GPU's in multiple applications.
Nvidia SMI shows the correct GPUs but using both of them in torch application or cuda application fails when getting available devices.
If I export CUDA_VISIBLE_DEVICES=0 or 1
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
It seems to work. I understand that there may be a mismatch between the gpus however even using them independently has issues unless I do the steps above.
I tired cuda 12.8 and 12.9 and they all don't work. I am assuming at this point this may be a driver issue? or is it a cuda issue?
To Reproduce
Using any compiled libraries with like vllm/llama.cpp causes the issues.
Also tried:
export CUDA_VISIBLE_DEVICES=0,1
/usr/local/cuda-12.8/extras/CUPTI/samples/event_multi_gpu$ ./event_multi_gpu
Usage: ./event_multi_gpu [event_name]
Error: event_multi_gpu.cu:63: Function cuInit(0) failed with error(3): initialization error.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response
Hi all I am back. I am seeing people having similar issues on the cuda forums. I tried all versions Cuda 12.8/12.9 and the last 2-3 open versions. https://forums.developer.nvidia.com/t/cuda-12-8-with-driver-version-570-124-06-on-b200-hgx-getting-code-3-cudaerrorinitializationerror/331233/3 https://forums.developer.nvidia.com/t/cuda-cant-initialize-after-upgrade/332770
Im not sure how I can further test this
Related to #797