bash ./finetune_codegeex.sh
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 14
building GPT2BPETokenizer tokenizer ...
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
(rank=2) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500
(rank=1) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500 > (rank=3) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500
Traceback (most recent call last):
File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/tools/pretrain_codegeex.py", line 203, in
pretrain(
File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/training.py", line 110, in pretrain
initialize_megatron(
File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 93, in initialize_megatron
finish_mpu_init()
File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 79, in finish_mpu_init
_set_random_seed(args.seed)
File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 300, in _set_random_seed
mpu.model_parallel_cuda_manual_seed(seed)
File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 225, inmodel_parallel_cuda_manual_seed
if dist.get_rank() == 0:
File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 499, in get_rank
assert cdb is not None and cdb.is_initialized(
AssertionError: DeepSpeed backend not set, please initialize it using init_process_group()
[2023-05-09 21:43:41,156] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12610
[2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12611
[2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12612
[2023-05-09 21:43:41,173] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12613
[2023-05-09 21:43:41,180] [ERROR] [launch.py:434:sigkill_handler]
https://github.com/microsoft/DeepSpeed/issues/2168
这是一个deepSpeed的一个bug,用的是0.9.1版本,卸载掉重新安装成0.6.3版本即可
请问您全量微调的设置是怎么样的?我目前两张32g的V100,在Zero-stage=3的情况下也会爆显存,希望可以交流一下