BMCook icon indicating copy to clipboard operation
BMCook copied to clipboard

ImportError: /home/miniconda3/envs/BMCook/lib/python3.10/site-packages/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17198) of binary: /home/miniconda3/envs/BMCook/bin/python

Open wln20 opened this issue 2 years ago • 3 comments

Hi, I encountered the error described in the title of this issue, while trying to run the gpt-2 example. Here is my command:

export CUDA_VISIBLE_DEVICES=7
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost ./gpt2_test.py \
    --model gpt2-base \
    --save-dir results/gpt2-prune \
    --data-path ... \
    --cook-config configs/gpt2-prune.json \

It seems that this is an error within the package bmtrain, so could you help figure out what happened or how to avoid it? Thanks a lot!

wln20 avatar May 19 '23 03:05 wln20

Sorry for the delay! This is probably a CUDA version mismatch, so you'd better check it. Generally, CUDA 11 will work normally.

gongbaitao avatar May 28 '23 03:05 gongbaitao

my cuda version is 11.7 and I'm still suffering this issue. Why insist on using this annoying package bmtrain?

sjcfr avatar Jun 06 '23 12:06 sjcfr

my cuda version is 11.7 and I'm still suffering this issue. Why insist on using this annoying package bmtrain?

I also encountered this problem, link https://github.com/OpenBMB/CPM-Bee/issues/18, and it can not reslove.

ryzn0518 avatar Jun 08 '23 14:06 ryzn0518