Error on NCCL:
python -m ochat.serving.openai_api_server --model openchat/openchat_3.5 --dtype float16
(as per readme, as without the --dtype float16 arguments, i got "ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5." )
I have one GPU
Now when I run the above, after a huge bunch of output, the run fails here:
File "/data/anaconda/envs/ptca/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce work = group.allreduce([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3 ncclInvalidArgument: Invalid value for an argument. Last error: Invalid config blocking attribute value -2147483648
It seems there are two nccls :-( Which one should I remove, and how please?
~$ pip list | grep nccl
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1