open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Cuda 12.8 Multi GPU Blackwell/A100 Fails

Open vladrad opened this issue 8 months ago • 2 comments

NVIDIA Open GPU Kernel Modules Version

575.51.03 and 570.133.20 and 570.124.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Ubuntu 22.04.3 LTS

Kernel Release

6.8.0-59-generic #61~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 17:03:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition, GPU 1: NVIDIA A100 80GB PCIe

Describe the bug

I am not able to call get available GPU's in multiple applications.

Nvidia SMI shows the correct GPUs but using both of them in torch application or cuda application fails when getting available devices.

If I export CUDA_VISIBLE_DEVICES=0 or 1

sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm

It seems to work. I understand that there may be a mismatch between the gpus however even using them independently has issues unless I do the steps above.

I tired cuda 12.8 and 12.9 and they all don't work. I am assuming at this point this may be a driver issue? or is it a cuda issue?

To Reproduce

Using any compiled libraries with like vllm/llama.cpp causes the issues.

Also tried:

export CUDA_VISIBLE_DEVICES=0,1
/usr/local/cuda-12.8/extras/CUPTI/samples/event_multi_gpu$ ./event_multi_gpu
Usage: ./event_multi_gpu [event_name]

Error: event_multi_gpu.cu:63: Function cuInit(0) failed with error(3): initialization error.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log

More Info

No response

vladrad avatar May 19 '25 18:05 vladrad

Hi all I am back. I am seeing people having similar issues on the cuda forums. I tried all versions Cuda 12.8/12.9 and the last 2-3 open versions. https://forums.developer.nvidia.com/t/cuda-12-8-with-driver-version-570-124-06-on-b200-hgx-getting-code-3-cudaerrorinitializationerror/331233/3 https://forums.developer.nvidia.com/t/cuda-cant-initialize-after-upgrade/332770

Im not sure how I can further test this

vladrad avatar May 20 '25 22:05 vladrad

Related to #797

Diatrus avatar Jul 16 '25 01:07 Diatrus