onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

onnxruntime-gpu not working with my gpu / setup

Open jillson opened this issue 1 year ago • 3 comments

Describe the issue

Configuration: RTX 3050 (Laptop). nvidia-smi indicates I'm using 555.85 as my driver, and CUDA 12.5, which is slightly confusing in as much as I uninstalled CUDA 12.5 and installed 12.1 based on version compatibility indicated. CUDNN 8.9.7.29 is installed but doesn't appear to be used Using pytorch (torch 2.1.2+cu121), onnx 1.16.1, and onnxruntime-gpu 1.18.1

When I try to run the necked down code (see below), I get an error about not being able to load CUDA 024-06-30 15:21:47.1727155 [E:onnxruntime:Default, provider_bridge_ort.cc:1745 onnxruntime::TryGetProviderInfo_CUDA] D:\a\_work\1\s\onnxruntime\core\session\provider_bridge_ort.cc:1426 onnxruntime::ProviderLibrary::Get [ONNXRuntimeError] : 1 : FAIL : LoadLibrary failed with error 126 "" when trying to load "(snip)\venv\lib\site-packages\onnxruntime\capi\onnxruntime_providers_cuda.dll"

To reproduce

Activate my virtualenvironment (which has the versions listed above) and then ran:

import onnxruntime as ort

providers = [("CUDAExecutionProvider", {"device_id": torch.cuda.current_device(),
                                        "user_compute_stream": str(torch.cuda.current_stream().cuda_stream)})]
sess_options = ort.SessionOptions()

import os
import psutil
p = psutil.Process(os.getpid())
for lib in p.memory_maps():
   print(lib.path)

model_path = "./venv/Lib/site-packages/onnx/backend/test/data/node/test_simple_rnn_batchwise/model.onnx"
try:
   sess = ort.InferenceSession(model_path, sess_options=sess_options, providers=providers)
except:
   pass

I get back the error above and also a list of DLLs etc. loaded; these include torch's cudnn (and zlib) [and not the nvidia cuda/cudnn files I installed; at least one stackoverflow post indicated with pytorch I'm using, shouldn't need those as they've helpfully baked it in; I've tried unsetting the CUDNN/CUDA environment variables and removing from $PATH and had same behavior.

Urgency

Very low

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.1 (or trying to; may be somehow still using CUDA 12.5)

jillson avatar Jun 30 '24 19:06 jillson

Minor update: https://stackoverflow.com/a/53504578 would seem to indicate my nvdia-smi behavior is expected (as in: I do in fact have 12.1 installed (nvcc --version returns 12.1) but I have the latest (or at least newer) driver that supports up to 12.5 ... given I'm using pytorch 2.1.2+cu121, going to assume I'm effectively running CUDA 12.1 for purposes here)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

jillson avatar Jun 30 '24 19:06 jillson

Could you use dependency walker to load onnxruntime_providers_cuda.dll and take a look which dependent dll is missing?

mszhanyi avatar Jul 01 '24 02:07 mszhanyi

1.18.1 for cuda 12 requires cudnn 9.* instead of 8.*. See release note: https://github.com/microsoft/onnxruntime/releases/tag/v1.18.1

tianleiwu avatar Jul 01 '24 03:07 tianleiwu

And python does not use PATH env for searching DLLs

snnn avatar Jul 01 '24 16:07 snnn

1.18.1 for cuda 12 requires cudnn 9.* instead of 8.*. See release note: https://github.com/microsoft/onnxruntime/releases/tag/v1.18.1

Hmm.... https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements apparently needs to be updated to distinguish 1.18.1 (which as you note requires cudnn 9) vs 1.18.0 which worked with cudnn 8.9. Currently trying to download torch 's nightly which I'm hoping will get me cudnn 9 (my attempt to overwrite torch's "vendored" 8.X dlls with the 9.X dlls I had downloaded went about as well as you'd expect) ... if that doesn't work, will likely try to roll back to 1.18.0 binary and see if that then gets things aligned.

Thanks for the reminder about python not using PATH for DLL finding.

jillson avatar Jul 01 '24 22:07 jillson

Hmm... now getting OSError: [WinError 126] The specified module could not be found. Error loading "C:\Users\jills\git\stable-diffusion-webui\venv\lib\site-packages\torch\lib\fbgemm.dll" or one of its dependencies. for both the using latest nightly pytorch (which does have Cudnn 9 dlls ) and for, in a different venv, reverting to onnxruntime-gpu==1.18.0 ... at this point, likely best thing to do would be to reinstall the venv, but I'm going to wait until next week due to much slower internet this week.

jillson avatar Jul 02 '24 00:07 jillson

I'm also unable to use Cuda for reactor. I'm receiving the same error as OP

NulliferBones avatar Jul 05 '24 00:07 NulliferBones

Switching to torch for cudnn 8.9 / onxxruntime-gpu==1.18.0 my simple example now works .... but I'm still getting FAIL : LoadLibrary failed with error 126 when trying to load onnxruntime_providers_cuda.dll in stable diffusion which is what I care about...

jillson avatar Jul 10 '24 12:07 jillson

And looking more closely at the error, somehow my virtualenv has gotten dorked up under git bash, leading to it finding (stale) dll's in global python over the ones in the venv; switching to powershell/cmd and that seems to now finally use cuda for stable diffusion (or at least not throw errors and fall back to CPU ... but I'm still seeing it take way longer than I'd like to run AND perfmon indicates 0% GPU utilization ... sigh )

jillson avatar Jul 10 '24 12:07 jillson