ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

Streaming + Packing + resume_from_checkpoint时出现报错

Open hertz-pj opened this issue 9 months ago • 3 comments

Describe the bug 在使用Streaming + Packing + resume_from_checkpoint时报错,目测是再跳过已训练的batch时出现的问题 错误日志:

[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/cli/sft.py", line 7, in <module>
[rank0]:     sft_main()
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 281, in sft_main
[rank0]:     return SwiftSft(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/base.py", line 47, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 147, in run
[rank0]:     return self.train(trainer)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 207, in train
[rank0]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/swift/trainers/mixin.py", line 321, in train
[rank0]:     res = super().train(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2482, in _inner_training_loop
[rank0]:     epoch_dataloader = skip_first_batches(epoch_dataloader, steps_trained_in_current_epoch)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 1338, in skip_first_batches
[rank0]:     dataset = dataloader.dataset
[rank0]:               ^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DataLoaderDispatcher' object has no attribute 'dataset'

启动脚本:

swift sft \
    --custom_register_path train/custom_model.py \
    --model $model_path \
    --model_type $model_type \
    --dataset  $train_data_path  \
    --val_dataset  $val_data_path  \
    --dataset_num_proc 1 \
    --train_type full \
    --torch_dtype bfloat16 \
    --num_train_epochs $epoch \
    --per_device_train_batch_size $batch_size \
    --per_device_eval_batch_size $batch_size \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 8 \
    --eval_steps 10000000 \
    --save_steps 20000 \
    --logging_steps 100 \
    --max_steps 10000000 \
    --max_length $max_length \
    --output_dir $output_dir \
    --warmup_ratio 0 \
    --packing true \
    --attn_impl flash_attn \
    --streaming true \
    --resume_from_checkpoint $checkpoint-80000 \
    --dataloader_num_workers 1 2>&1 | tee $output_dir/train.log

Your hardware and system info torch==2.5.1 ms-swift==3.4.0

hertz-pj avatar May 05 '25 09:05 hertz-pj

有计划什么时间修复该bug吗,或者一些绕过该bug的trick方案。

hertz-pj avatar May 06 '25 06:05 hertz-pj

可以先尝试--resume_only_model true

Jintao-Huang avatar May 06 '25 07:05 Jintao-Huang

可以先尝试--resume_only_model true

感谢回复,这应该不太行,需要optimizer和数据集信息来继续训练。

hertz-pj avatar May 06 '25 07:05 hertz-pj

可以先尝试--resume_only_model true

感谢回复,这应该不太行,需要optimizer和数据集信息来继续训练。

如果不介意重复跑一些数据,可以加上 --ignore_data_skip true

Byshev333 avatar Jun 15 '25 13:06 Byshev333

This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.

github-actions[bot] avatar Sep 16 '25 00:09 github-actions[bot]

This issue has been automatically closed due to inactivity. If needed, it can be reopened.

github-actions[bot] avatar Oct 10 '25 00:10 github-actions[bot]