Streaming + Packing + resume_from_checkpoint时出现报错
Describe the bug 在使用Streaming + Packing + resume_from_checkpoint时报错,目测是再跳过已训练的batch时出现的问题 错误日志:
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/cli/sft.py", line 7, in <module>
[rank0]: sft_main()
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 281, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 147, in run
[rank0]: return self.train(trainer)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 207, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/swift/trainers/mixin.py", line 321, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2482, in _inner_training_loop
[rank0]: epoch_dataloader = skip_first_batches(epoch_dataloader, steps_trained_in_current_epoch)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 1338, in skip_first_batches
[rank0]: dataset = dataloader.dataset
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: AttributeError: 'DataLoaderDispatcher' object has no attribute 'dataset'
启动脚本:
swift sft \
--custom_register_path train/custom_model.py \
--model $model_path \
--model_type $model_type \
--dataset $train_data_path \
--val_dataset $val_data_path \
--dataset_num_proc 1 \
--train_type full \
--torch_dtype bfloat16 \
--num_train_epochs $epoch \
--per_device_train_batch_size $batch_size \
--per_device_eval_batch_size $batch_size \
--learning_rate 1e-4 \
--gradient_accumulation_steps 8 \
--eval_steps 10000000 \
--save_steps 20000 \
--logging_steps 100 \
--max_steps 10000000 \
--max_length $max_length \
--output_dir $output_dir \
--warmup_ratio 0 \
--packing true \
--attn_impl flash_attn \
--streaming true \
--resume_from_checkpoint $checkpoint-80000 \
--dataloader_num_workers 1 2>&1 | tee $output_dir/train.log
Your hardware and system info torch==2.5.1 ms-swift==3.4.0
有计划什么时间修复该bug吗,或者一些绕过该bug的trick方案。
可以先尝试--resume_only_model true
可以先尝试--resume_only_model true
感谢回复,这应该不太行,需要optimizer和数据集信息来继续训练。
可以先尝试--resume_only_model true
感谢回复,这应该不太行,需要optimizer和数据集信息来继续训练。
如果不介意重复跑一些数据,可以加上 --ignore_data_skip true
This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.
This issue has been automatically closed due to inactivity. If needed, it can be reopened.