failed (exitcode: -11) local_rank: 5 (pid: 11514) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python
Describe the bug
再进行多机lora微调时出错:
failed (exitcode: -11) local_rank: 5 (pid: 11514) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python
Traceback (most recent call last):
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in
这是我的脚本: #!/bin/bash
默认参数设置
VISIBLE_DEVICES="0,1,2,3,4,5,6,7" NNODES=4 NODE_RANK=0 MASTER_ADDR="10.178.141.248" MASTER_PORT=29500 NPROC_PER_NODE=8 MODEL_TYPE="qwen2-7b" MODEL_PATH="/home/jovyan/kys-workspace-zzzc/models/Qwen2-7B" DATASET="/home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/train_data_1000.json" MAX_LENGTH=32768 NUM_TRAIN_EPOCHS=1 BATCH_SIZE=1 LEARNING_RATE=1e-4 EVAL_STEPS=100 LOGGING_STEPS=10 SEQUENCE_PARALLEL_SIZE=4 DEEPSPEED="default-zero3" DDP_BACKEND="nccl" OUTPUT_DIR="/home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model" GRADIENT_CHECKPOINTING=true USE_FLASH_ATTN=true LAZY_TOKENIZE=true CHECK_MODEL_IS_LATEST=false SAVE_ON_EACH_NODE=false DISABLE_TQDM=true
解析命令行参数
while [[ $# -gt 0 ]]; do key="$1"
case $key in --visible_devices) VISIBLE_DEVICES="$2"; shift; shift ;; --nnodes) NNODES="$2"; shift; shift ;; --node_rank) NODE_RANK="$2"; shift; shift ;; --master_addr) MASTER_ADDR="$2"; shift; shift ;; --master_port) MASTER_PORT="$2"; shift; shift ;; --nproc_per_node) NPROC_PER_NODE="$2"; shift; shift ;; --sft_type) SFT_TYPE="$2"; shift; shift ;; --model_type) MODEL_TYPE="$2"; shift; shift ;; --model_path) MODEL_PATH="$2"; shift; shift ;; --dataset) DATASET="$2"; shift; shift ;; --max_length) MAX_LENGTH="$2"; shift; shift ;; --num_train_epochs) NUM_TRAIN_EPOCHS="$2"; shift; shift ;; --batch_size) BATCH_SIZE="$2"; shift; shift ;; --learning_rate) LEARNING_RATE="$2"; shift; shift ;; --eval_steps) EVAL_STEPS="$2"; shift; shift ;; --logging_steps) LOGGING_STEPS="$2"; shift; shift ;; --sequence_parallel_size) SEQUENCE_PARALLEL_SIZE="$2"; shift; shift ;; --deepspeed) DEEPSPEED="$2"; shift; shift ;; --ddp_backend) DDP_BACKEND="$2"; shift; shift ;; --output_dir) OUTPUT_DIR="$2"; shift; shift ;; --gradient_checkpointing) GRADIENT_CHECKPOINTING="$2"; shift; shift ;; --use_flash_attn) USE_FLASH_ATTN="$2"; shift; shift ;; --lazy_tokenize) LAZY_TOKENIZE="$2"; shift; shift ;; --check_model_is_latest) CHECK_MODEL_IS_LATEST="$2"; shift; shift ;; --save_on_each_node) SAVE_ON_EACH_NODE="$2"; shift; shift ;; --disable_tqdm) DISABLE_TQDM="$2"; shift; shift ;; *) echo "未知参数 $1"; exit 1 ;; esac done
定义清理函数
cleanup() { echo "捕获到异常退出,执行清理操作..." kill -9 $(lsof -t -i :$MASTER_PORT)
rm -rf $OUTPUT_DIR
}
设置陷阱以捕获错误和退出信号
trap cleanup ERR EXIT
运行命令
CUDA_VISIBLE_DEVICES=$VISIBLE_DEVICES
NNODES=$NNODES
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$NPROC_PER_NODE
swift sft
--model_type qwen2-7b-instruct
--model_id_or_path /home/jovyan/kys-workspace-zzzc/models/Qwen2-7B-Instruct
--model_revision master
--sft_type lora
--tuner_backend peft
--template_type AUTO
--dtype AUTO
--output_dir /home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model
--dataset /home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/train_data_1000.json
--val_dataset /home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/dev_data_1000.json
--use_loss_scale true
--num_train_epochs 2
--max_length $MAX_LENGTH
--truncation_strategy delete
--check_dataset_strategy warning
--lora_rank 5
--lora_alpha 32
--lora_dropout_p 0.05
--lora_target_modules ALL
--gradient_checkpointing true
--batch_size 1
--eval_batch_size 1
--weight_decay 0.1
--learning_rate 1e-4
--gradient_accumulation_steps 16
--max_grad_norm 0.5
--warmup_ratio 0.03
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 10
--use_flash_attn false
--self_cognition_sample 0
--deepspeed default-zero3
--sequence_parallel_size $SEQUENCE_PARALLEL_SIZE
--ddp_backend $DDP_BACKEND
--gradient_checkpointing $GRADIENT_CHECKPOINTING
--check_model_is_latest $CHECK_MODEL_IS_LATEST
--save_on_each_node $SAVE_ON_EACH_NODE
--disable_tqdm $DISABLE_TQDM
可以帮忙看看解决一下吗
是卡住了还是报错呀
如果是报错的话,有swift相关的报错栈信息吗
没有报错信息,check_dataset_strategy设置成warning没有打印数据,像是还没开始训练,是卡住了吗
W0726 18:36:31.897000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24503 closing signal SIGTERM
W0726 18:36:31.898000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24504 closing signal SIGTERM
W0726 18:36:31.899000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24505 closing signal SIGTERM
W0726 18:36:31.900000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24506 closing signal SIGTERM
W0726 18:36:31.901000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24507 closing signal SIGTERM
W0726 18:36:31.901000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24508 closing signal SIGTERM
W0726 18:36:31.903000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24509 closing signal SIGTERM
W0726 18:37:01.903000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24503 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:04.172000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24504 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:05.692000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24505 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:07.426000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24506 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:09.044000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24507 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:10.729000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24508 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0726 18:37:12.627000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24509 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0726 18:37:14.260000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 7 (pid: 24510) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python
Traceback (most recent call last):
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/jovyan/dataws1/swift/swift/swift/cli/sft.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-07-26_18:36:31 host : zhangzhenzhong-fine-tuning-m-0.zhangzhenzhong-fine-tuning.prdsafe.svc.hbox2-zzzc2-prd.local rank : 7 (local_rank: 7) exitcode : -11 (pid: 24510) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 24510
py-spy dump <pid>可以查看卡在哪了
你看看
Error: Failed to get process executable name. Check that the process is running. Reason: No such file or directory (os error 2) Reason: No such file or directory (os error 2)
py-spy dump <pid>可以查看卡在哪了 你看看
这是环境配置的问题吗,似乎在一台机器上也有这个问题
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0
deepspeed:0.14.5+unknown
torch:2.4.0+cu121
有可能时环境的问题吗?
然后我使用单机调试时,在训练一开始出现了这个问题:
[INFO:swift] The logging file will be saved in: /home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model/qwen2-7b-instruct/v7-20240727-152355/logging.jsonl
Train: 0%| | 0/4 [00:00<?, ?it/s]/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
W0727 15:34:57.720000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3102 closing signal SIGTERM
W0727 15:34:57.720000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3103 closing signal SIGTERM
W0727 15:34:57.721000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3105 closing signal SIGTERM
W0727 15:34:57.722000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3107 closing signal SIGTERM
W0727 15:35:27.722000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3102 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0727 15:35:27.968000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3103 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0727 15:35:28.202000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3105 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
W0727 15:35:28.518000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3107 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
E0727 15:35:28.777000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 3101) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python
Traceback (most recent call last):
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/jovyan/dataws1/swift/swift/swift/cli/sft.py FAILED
export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_DEBUG=info export NCCL_SOCKET_IFNAME=eth0
添加了这些解决了单机调试,但多机调试好像和max_length有关,当设置2048的时候成功了,按照原长度的时候是失败的,这是为什么?按理不应该报OOM的错误吗
请问你解决这个问题了吗?我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror,而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误,为了避免是显存的问题,特意调成了2b的模型,而且开了bf16,帧率和pixel,token数量都做了压缩,确认显存是足够的
请问你解决这个问题了吗?我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror,而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误,为了避免是显存的问题,特意调成了2b的模型,而且开了bf16,帧率和pixel,token数量都做了压缩,确认显存是足够的
我单机操作也遇到这个问题了
请问你解决这个问题了吗?我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror,而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误,为了避免是显存的问题,特意调成了2b的模型,而且开了bf16,帧率和pixel,token数量都做了压缩,确认显存是足够的
请问你解决这个问题了吗?
请问你解决这个问题了吗?我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror,而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误,为了避免是显存的问题,特意调成了2b的模型,而且开了bf16,帧率和pixel,token数量都做了压缩,确认显存是足够的
我单机操作也遇到这个问题了
请问你解决这个问题了吗?