CodeGeeX finetune_codegeex.sh执行失败

bash ./finetune_codegeex.sh -------------------- end of arguments --------------------- setting number of micro-batches to constant 14

building GPT2BPETokenizer tokenizer ... WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written. (rank=2) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500 (rank=1) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500 > (rank=3) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500

Traceback (most recent call last): File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/tools/pretrain_codegeex.py", line 203, in pretrain( File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/training.py", line 110, in pretrain initialize_megatron( File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 93, in initialize_megatron finish_mpu_init() File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 79, in finish_mpu_init _set_random_seed(args.seed) File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 300, in _set_random_seed mpu.model_parallel_cuda_manual_seed(seed) File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 225, inmodel_parallel_cuda_manual_seed if dist.get_rank() == 0: File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 499, in get_rank assert cdb is not None and cdb.is_initialized( AssertionError: DeepSpeed backend not set, please initialize it using init_process_group() [2023-05-09 21:43:41,156] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12610 [2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12611 [2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12612 [2023-05-09 21:43:41,173] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12613 [2023-05-09 21:43:41,180] [ERROR] [launch.py:434:sigkill_handler]

May 09 '23 13:05 wangyang135

https://github.com/microsoft/DeepSpeed/issues/2168 这是一个deepSpeed的一个bug，用的是0.9.1版本，卸载掉重新安装成0.6.3版本即可

May 09 '23 14:05 wangyang135

请问您全量微调的设置是怎么样的？我目前两张32g的V100，在Zero-stage=3的情况下也会爆显存，希望可以交流一下

Jul 26 '23 07:07 toufunao