[BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined
🐛 Describe the bug
Installation
Installation steps:
# create and activate virtualenv
# install
cd application/ChatGPT
pip install .
# test
cd examples
sh train_dummy.sh
First Error with virtualenv
Then, this error message pop up:
OSError: /root/bin/ai/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
Seems there is something wrong with CUDA ...
Second Error with CUDA
And when I change python environments with command deactivate, return to default python env, the error disappears and another one comes up
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 146 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 147) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dummy.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-02-24_17:26:03
host : mlxlab6zo2knwh6360c359-20221101065730-6hyrm1-ejpeww-worker
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 147)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
INFO[0076] Worker 0 Status Failed host=10.22.148.79 message= reason=Error
error: exec command: 0
Why ?
CUDA error: invalid device ordinal
Environment
Environment:
- python 3.7.3
- pytorch: 1.13.1
- cuda: 11.3
- cpu: 1
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01 Driver Version: 450.191.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:8D:00.0 Off | 0 |
| N/A 39C P0 45W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
It might happen due to mismatch of torch and cuda versions. Could you try reinstall torch via conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.
I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.
Thank you! Worked for me
because your dir have other dir,so delete them 因为你的目录多了其他原来不存在的, 所以你删掉那些不属于项目内容的文件夹