Describe the bug 再进行多机lora微调时出错： failed (exitcode: -11) local_rank: 5 (pid: 11514) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python Traceback (most recent call last): File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

这是我的脚本： #!/bin/bash

默认参数设置

VISIBLE_DEVICES="0,1,2,3,4,5,6,7" NNODES=4 NODE_RANK=0 MASTER_ADDR="10.178.141.248" MASTER_PORT=29500 NPROC_PER_NODE=8 MODEL_TYPE="qwen2-7b" MODEL_PATH="/home/jovyan/kys-workspace-zzzc/models/Qwen2-7B" DATASET="/home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/train_data_1000.json" MAX_LENGTH=32768 NUM_TRAIN_EPOCHS=1 BATCH_SIZE=1 LEARNING_RATE=1e-4 EVAL_STEPS=100 LOGGING_STEPS=10 SEQUENCE_PARALLEL_SIZE=4 DEEPSPEED="default-zero3" DDP_BACKEND="nccl" OUTPUT_DIR="/home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model" GRADIENT_CHECKPOINTING=true USE_FLASH_ATTN=true LAZY_TOKENIZE=true CHECK_MODEL_IS_LATEST=false SAVE_ON_EACH_NODE=false DISABLE_TQDM=true

解析命令行参数

while [[ $# -gt 0 ]]; do key="$1"

case $key in --visible_devices) VISIBLE_DEVICES="$2"; shift; shift ;; --nnodes) NNODES="$2"; shift; shift ;; --node_rank) NODE_RANK="$2"; shift; shift ;; --master_addr) MASTER_ADDR="$2"; shift; shift ;; --master_port) MASTER_PORT="$2"; shift; shift ;; --nproc_per_node) NPROC_PER_NODE="$2"; shift; shift ;; --sft_type) SFT_TYPE="$2"; shift; shift ;; --model_type) MODEL_TYPE="$2"; shift; shift ;; --model_path) MODEL_PATH="$2"; shift; shift ;; --dataset) DATASET="$2"; shift; shift ;; --max_length) MAX_LENGTH="$2"; shift; shift ;; --num_train_epochs) NUM_TRAIN_EPOCHS="$2"; shift; shift ;; --batch_size) BATCH_SIZE="$2"; shift; shift ;; --learning_rate) LEARNING_RATE="$2"; shift; shift ;; --eval_steps) EVAL_STEPS="$2"; shift; shift ;; --logging_steps) LOGGING_STEPS="$2"; shift; shift ;; --sequence_parallel_size) SEQUENCE_PARALLEL_SIZE="$2"; shift; shift ;; --deepspeed) DEEPSPEED="$2"; shift; shift ;; --ddp_backend) DDP_BACKEND="$2"; shift; shift ;; --output_dir) OUTPUT_DIR="$2"; shift; shift ;; --gradient_checkpointing) GRADIENT_CHECKPOINTING="$2"; shift; shift ;; --use_flash_attn) USE_FLASH_ATTN="$2"; shift; shift ;; --lazy_tokenize) LAZY_TOKENIZE="$2"; shift; shift ;; --check_model_is_latest) CHECK_MODEL_IS_LATEST="$2"; shift; shift ;; --save_on_each_node) SAVE_ON_EACH_NODE="$2"; shift; shift ;; --disable_tqdm) DISABLE_TQDM="$2"; shift; shift ;; *) echo "未知参数 $1"; exit 1 ;; esac done

定义清理函数

cleanup() { echo "捕获到异常退出，执行清理操作..." kill -9 $(lsof -t -i :$MASTER_PORT)

rm -rf $OUTPUT_DIR

}

设置陷阱以捕获错误和退出信号

trap cleanup ERR EXIT

运行命令

CUDA_VISIBLE_DEVICES=$VISIBLE_DEVICES
NNODES=$NNODES
NODE_RANK=$NODE_RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$NPROC_PER_NODE
swift sft
--model_type qwen2-7b-instruct
--model_id_or_path /home/jovyan/kys-workspace-zzzc/models/Qwen2-7B-Instruct
--model_revision master
--sft_type lora
--tuner_backend peft
--template_type AUTO
--dtype AUTO
--output_dir /home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model
--dataset /home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/train_data_1000.json
--val_dataset /home/jovyan/dataws1/fine-wenshu/data/patent_gpt4o/dev_data_1000.json
--use_loss_scale true
--num_train_epochs 2
--max_length $MAX_LENGTH
--truncation_strategy delete
--check_dataset_strategy warning
--lora_rank 5
--lora_alpha 32
--lora_dropout_p 0.05
--lora_target_modules ALL
--gradient_checkpointing true
--batch_size 1
--eval_batch_size 1
--weight_decay 0.1
--learning_rate 1e-4
--gradient_accumulation_steps 16
--max_grad_norm 0.5
--warmup_ratio 0.03
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 10
--use_flash_attn false
--self_cognition_sample 0
--deepspeed default-zero3
--sequence_parallel_size $SEQUENCE_PARALLEL_SIZE
--ddp_backend $DDP_BACKEND
--gradient_checkpointing $GRADIENT_CHECKPOINTING
--check_model_is_latest $CHECK_MODEL_IS_LATEST
--save_on_each_node $SAVE_ON_EACH_NODE
--disable_tqdm $DISABLE_TQDM

可以帮忙看看解决一下吗

Jul 26 '24 08:07 shyzzz521

是卡住了还是报错呀

如果是报错的话，有swift相关的报错栈信息吗

Jul 26 '24 10:07 Jintao-Huang

没有报错信息，check_dataset_strategy设置成warning没有打印数据，像是还没开始训练，是卡住了吗

W0726 18:36:31.897000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24503 closing signal SIGTERM W0726 18:36:31.898000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24504 closing signal SIGTERM W0726 18:36:31.899000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24505 closing signal SIGTERM W0726 18:36:31.900000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24506 closing signal SIGTERM W0726 18:36:31.901000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24507 closing signal SIGTERM W0726 18:36:31.901000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24508 closing signal SIGTERM W0726 18:36:31.903000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 24509 closing signal SIGTERM W0726 18:37:01.903000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24503 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:04.172000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24504 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:05.692000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24505 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:07.426000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24506 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:09.044000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24507 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:10.729000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24508 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0726 18:37:12.627000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 24509 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL E0726 18:37:14.260000 140477538637632 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 7 (pid: 24510) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python Traceback (most recent call last): File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/jovyan/dataws1/swift/swift/swift/cli/sft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-07-26_18:36:31 host : zhangzhenzhong-fine-tuning-m-0.zhangzhenzhong-fine-tuning.prdsafe.svc.hbox2-zzzc2-prd.local rank : 7 (local_rank: 7) exitcode : -11 (pid: 24510) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 24510

Jul 26 '24 11:07 shyzzz521

py-spy dump <pid>可以查看卡在哪了你看看

Jul 26 '24 11:07 Jintao-Huang

Error: Failed to get process executable name. Check that the process is running. Reason: No such file or directory (os error 2) Reason: No such file or directory (os error 2)

Jul 26 '24 11:07 shyzzz521

py-spy dump <pid>可以查看卡在哪了你看看

这是环境配置的问题吗，似乎在一台机器上也有这个问题

Jul 27 '24 14:07 shyzzz521

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:02:13_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0

deepspeed:0.14.5+unknown

torch:2.4.0+cu121

有可能时环境的问题吗？

Jul 27 '24 15:07 shyzzz521

然后我使用单机调试时，在训练一开始出现了这个问题： [INFO:swift] The logging file will be saved in: /home/jovyan/dataws1/fine-wenshu/model/qwen2_7b_patent_model/qwen2-7b-instruct/v7-20240727-152355/logging.jsonl Train: 0%| | 0/4 [00:00<?, ?it/s]/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] /home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] W0727 15:34:57.720000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3102 closing signal SIGTERM W0727 15:34:57.720000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3103 closing signal SIGTERM W0727 15:34:57.721000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3105 closing signal SIGTERM W0727 15:34:57.722000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3107 closing signal SIGTERM W0727 15:35:27.722000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3102 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0727 15:35:27.968000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3103 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0727 15:35:28.202000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3105 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0727 15:35:28.518000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:875] Unable to shutdown process 3107 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL E0727 15:35:28.777000 139676076163712 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 3101) of binary: /home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/python Traceback (most recent call last): File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/bin/torchrun", line 8, in sys.exit(main()) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/jovyan/data-ws-enr/zconda/envs/swift_ft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/jovyan/dataws1/swift/swift/swift/cli/sft.py FAILED

Jul 27 '24 15:07 shyzzz521

export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export NCCL_DEBUG=info export NCCL_SOCKET_IFNAME=eth0

添加了这些解决了单机调试，但多机调试好像和max_length有关，当设置2048的时候成功了，按照原长度的时候是失败的，这是为什么？按理不应该报OOM的错误吗

Jul 28 '24 05:07 shyzzz521

请问你解决这个问题了吗？我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror，而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误，为了避免是显存的问题，特意调成了2b的模型，而且开了bf16，帧率和pixel，token数量都做了压缩，确认显存是足够的

Dec 30 '24 06:12 TimeLessLing

请问你解决这个问题了吗？我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror，而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误，为了避免是显存的问题，特意调成了2b的模型，而且开了bf16，帧率和pixel，token数量都做了压缩，确认显存是足够的

我单机操作也遇到这个问题了

Feb 04 '25 15:02 wyclike

请问你解决这个问题了吗？我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror，而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误，为了避免是显存的问题，特意调成了2b的模型，而且开了bf16，帧率和pixel，token数量都做了压缩，确认显存是足够的

请问你解决这个问题了吗？

Mar 12 '25 15:03 kono-dada

请问你解决这个问题了吗？我现在用qwen-vl也是lora时会训练很久之后遇到这个childfailederror，而且最后给的也是traceback : Signal 11 (SIGSEGV) received by PID xxx的错误，为了避免是显存的问题，特意调成了2b的模型，而且开了bf16，帧率和pixel，token数量都做了压缩，确认显存是足够的

我单机操作也遇到这个问题了

请问你解决这个问题了吗？

Mar 12 '25 15:03 kono-dada