openchat icon indicating copy to clipboard operation
openchat copied to clipboard

Error on NCCL:

Open itscvenk opened this issue 2 years ago • 0 comments

python -m ochat.serving.openai_api_server --model openchat/openchat_3.5 --dtype float16

(as per readme, as without the --dtype float16 arguments, i got "ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5." )

I have one GPU

Now when I run the above, after a huge bunch of output, the run fails here: File "/data/anaconda/envs/ptca/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3 ncclInvalidArgument: Invalid value for an argument. Last error: Invalid config blocking attribute value -2147483648

It seems there are two nccls :-( Which one should I remove, and how please?

~$ pip list | grep nccl
nvidia-nccl-cu11          2.14.3
nvidia-nccl-cu12          2.18.1

itscvenk avatar Dec 09 '23 17:12 itscvenk