compatible with pytorch and cuda 12.4

Open collyyang520 opened this issue 1 year ago • 1 comments

Hello everyone: I am Yang Ren shu. I recently encountered a version adjustment problem. Specifically, I encountered this problem RuntimeError: NVML_SUCCESS == DriverAPI::get()->nvmlInit_v2_() INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":963, please report a bug to PyTorch. I have found related articles. It indicates that sudo reboot is required, but I am worried that someone is using or needs this version. I would like to ask how you adjusted it. Thank you

I found that pytorch only supports cuda up to 12.4, but I originally installed 12.6. I want to change it directly to

wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
sudo sh cuda_12.4.0_550.54.14_linux.run,

but I am worried about affecting other people, and the original conda install pytorch torchvision torchaudio cudatoolkit=12.4 -c pytorch is installed locally, but it cannot be installed successfully.

My worry is that I am in workstation, and I am in a virtual environment. I am fear that my change cuda will make other unhappy. I only want to run the pytorch with GPU to accelate. By the way, I also a question:

/home/nthuuser/miniconda3/envs/earthquake/lib/python3.12/site-packages/torch/cuda/__init__.py:716: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

Could someone help me?　　These two problem really make me frustrate. Thanks in advanced.

Jan 12 '25 16:01 collyyang520

CUDA Version Mismatch: PyTorch in your setup supports CUDA up to 12.4, but the workstation has CUDA 12.6 installed. This mismatch causes the RuntimeError: NVML_SUCCESS == DriverAPI::get()->nvmlInit_v2_() issue, as PyTorch requires a compatible CUDA runtime to utilize the GPU. Directly downgrading the system-wide CUDA version could impact other users on the shared workstation.

NVML Initialization Failure: The warning Can't initialize NVML suggests that PyTorch cannot interface with the NVIDIA driver properly. This can stem from driver-toolkit mismatches or configuration issues, potentially leading to degraded GPU functionality like monitoring or resource allocation.

conda create -n pytorch_env python=3.12 -y conda activate pytorch_env

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia import torch print(torch.cuda.is_available()) # True print(torch.version.cuda) # Should match installed version, e.g., 12.1 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Additional :/ Use nvidia-smi to ensure the NVIDIA driver supports the desired CUDA version. For instance, driver version 525.60.13 or higher is needed for CUDA 12.x. Use CUDA_VISIBLE_DEVICES to allocate specific GPUs to your process if multiple users are sharing the same hardware: export CUDA_VISIBLE_DEVICES=0 # Assigns only GPU 0 to your process

Jan 23 '25 18:01 Bhazantri