ColossalAI [BUG] Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined

🐛 Describe the bug

Installation

Installation steps:

# create and activate virtualenv
# install 
cd application/ChatGPT
pip install .
# test
cd examples
sh train_dummy.sh

First Error with virtualenv

Then, this error message pop up:

OSError: /root/bin/ai/lib/python3.7/site-packages/nvidia/cublas/lib/libcublas.so.11: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference

Seems there is something wrong with CUDA ...

Second Error with CUDA

And when I change python environments with command deactivate, return to default python env, the error disappears and another one comes up

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 146 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 147) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dummy.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-24_17:26:03
  host      : mlxlab6zo2knwh6360c359-20221101065730-6hyrm1-ejpeww-worker
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 147)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
INFO[0076] Worker 0 Status Failed                        host=10.22.148.79 message= reason=Error
error: exec command: 0

Why ?

CUDA error: invalid device ordinal

Environment

Environment:

python 3.7.3
pytorch: 1.13.1
cuda: 11.3
cpu: 1

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:8D:00.0 Off |                    0 |
| N/A   39C    P0    45W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Feb 24 '23 09:02 wqw547243068

It might happen due to mismatch of torch and cuda versions. Could you try reinstall torch via conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

Feb 27 '23 04:02 JThh

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

Apr 19 '23 10:04 yctam

I removed the miniconda3/lib/python3.10/site-packages/nvidia/cublas and miniconda3/envs/cai/lib/python3.8/site-packages/torch/lib/libcublas*, and the error goes away.

Thank you! Worked for me

Apr 27 '23 15:04 yustiks

because your dir have other dir，so delete them 因为你的目录多了其他原来不存在的，所以你删掉那些不属于项目内容的文件夹

Oct 19 '23 09:10 ADongGu