CodeGeeX icon indicating copy to clipboard operation
CodeGeeX copied to clipboard

finetune_codegeex.sh执行失败

Open wangyang135 opened this issue 2 years ago • 2 comments

bash ./finetune_codegeex.sh -------------------- end of arguments --------------------- setting number of micro-batches to constant 14

building GPT2BPETokenizer tokenizer ... WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written. (rank=2) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500 (rank=1) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500 > (rank=3) initializing process group: world_size=4 backend=nccl init_method=tcp://127.0.0.1:29500

Traceback (most recent call last): File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/tools/pretrain_codegeex.py", line 203, in pretrain( File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/training.py", line 110, in pretrain initialize_megatron( File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 93, in initialize_megatron finish_mpu_init() File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 79, in finish_mpu_init _set_random_seed(args.seed) File "/dump/1/where.wy/CodeGeeX/codegeex/megatron/initialize.py", line 300, in _set_random_seed mpu.model_parallel_cuda_manual_seed(seed) File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 225, inmodel_parallel_cuda_manual_seed if dist.get_rank() == 0: File "/home/where.wy/anaconda3/envs/python310/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 499, in get_rank assert cdb is not None and cdb.is_initialized( AssertionError: DeepSpeed backend not set, please initialize it using init_process_group() [2023-05-09 21:43:41,156] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12610 [2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12611 [2023-05-09 21:43:41,165] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12612 [2023-05-09 21:43:41,173] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 12613 [2023-05-09 21:43:41,180] [ERROR] [launch.py:434:sigkill_handler]

wangyang135 avatar May 09 '23 13:05 wangyang135

https://github.com/microsoft/DeepSpeed/issues/2168 这是一个deepSpeed的一个bug,用的是0.9.1版本,卸载掉重新安装成0.6.3版本即可

wangyang135 avatar May 09 '23 14:05 wangyang135

请问您全量微调的设置是怎么样的?我目前两张32g的V100,在Zero-stage=3的情况下也会爆显存,希望可以交流一下

toufunao avatar Jul 26 '23 07:07 toufunao