[BUG]: exception: Encountered a bad command exit code!
🐛 Describe the bug
运行llama示例, bash batch8_seq512.sh 报错 exception: Encountered a bad command exit code! 运行gpt示例,单机可成功,使用colossal run 多机同样报以上错误。
(PyTorch-2.0.0) ma-user@bms-aiserver-pod05-97-200:/home/crq/ColossalAI/examples/language/llama/benchmark_7B/gemini_auto$ bash batch8_seq512.sh bash: UID: readonly variable Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=10.155.97.200 --master_port=29500 benchmark.py --plugin gemini -l 512 -g -b 8 on 10.155.97.200, is localhost: False, exception: Encountered a bad command exit code!
Command: 'cd /home/xxx/ColossalAI/examples/language/llama && export NVIDIA_VISIBLE_DEVICES="all" CONDA_EXE="/home/ma-user/anaconda3/bin/conda" ENV_NAME="PyTorch-2.0.0" HOSTNAME="xxx" PIP_VERSION="20.3.3" NVIDIA_REQUIRE_CUDA="cuda>=11.7" brand="tesla" NCCL_VERSION="2.13.4-1" NCCL_SOCKET_IFNAME="eth0" PWD="/home/xxx/ColossalAI/examples/language/llama" CONDA_PREFIX="/home/xxx/envs/PyTorch-2.0.0" NVIDIA_DRIVER_CAPABILITIES="compute,utility" CUDA_PKG_VERSION="11-7=11.7.99-1" NCCL_IB_HCA="mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1" HOME="/home/xxx" LANG="C.UTF-8" NCCL_IB_GID_INDEX="3" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" ANACONDA_DIR="/home/xxx/anaconda3" IS_NEW_IMAGE_FRAMEWORK="true" CUDA_DEVICE_ORDER="PCI_BUS_ID" CUDA_VERSION="11.7.1" CONDA_PROMPT_MODIFIER="(PyTorch-2.0.0) " SITE_PACKAGES_PATH="/home/xxx/PyTorch-2.0.0/lib/python3.7/site-packages" LESSCLOSE="/usr/bin/lesspipe %s %s" PYTHONPATH="/usr/local/seccomponent/lib:/home/ma-user/infer/model/1" TERM="xterm" LESSOPEN="| /usr/bin/lesspipe %s" LIBRARY_PATH="/usr/local/cuda/lib64/stubs:" CONDA_SHLVL="2" NCCL_IB_TIMEOUT="23" MODELARTS_MODEL_PATH="/home/xxx/infer/model/1" SHLVL="2" MODELARTS_BATCH_PREDICT_URL="http://127.0.0.1:8080" CUDNN_VERSION="8.5.0.96" CONDA_PYTHON_EXE="/home/xxx/anaconda3/bin/python" LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu/" NCCL_IB_DISABLE="0" CONDA_DEFAULT_ENV="PyTorch-2.0.0" PATH="/home/xxx/anaconda3/envs/PyTorch-2.0.0/bin:/usr/local/openmpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" NCCL_IB_RETRY_CNT="7" CONDA_PREFIX_1="/home/xxx/anaconda3" UID="root" OLDPWD="/home/xxx/ColossalAI/examples/language/llama/benchmark_7B/gemini_auto" _="/home/xxx/anaconda3/envs/PyTorch-2.0.0/bin/colossalai" && torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=10.155.97.200 --master_port=29500 benchmark.py --plugin gemini -l 512 -g -b 8'
Exit code: 1
Stdout: already printed
Stderr: already printed
Environment
pytorch 2.0.0 colossalai 0.3.0 transformers 4.33.1 CUDA Version: 12.2
Same problem