ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

steps如何计算的

Open toufunao opened this issue 9 months ago • 3 comments

我使用了以下脚本进行训练,数据集大小约为33000条数据,per_device_batch_size=16,gradient_accumenlation_steps=32,epochs=3,4张GPU。 nproc_per_node=4

NPROC_PER_NODE=$nproc_per_node
CUDA_VISIBLE_DEVICES=0,1,2,3
swift pt
--model Qwen/Qwen2.5-7B
--train_type full
--dataset $CUSTOM_DATASET
--torch_dtype bfloat16
--num_train_epochs 3
--per_device_train_batch_size 16
--per_device_eval_batch_size 1
--learning_rate 1e-5
--gradient_accumulation_steps $(expr 128 / $nproc_per_node)
--packing true
--eval_steps 10
--save_steps 50
--save_total_limit 2
--logging_steps 5
--deepspeed zero3
--max_length 8192
--warmup_ratio 0.05
--save_only_model true
--output_dir XXXXX

如果正常计算应该是33000*3/16/32/4=48,但是实际进度条显示是193steps。请问ms_swift如何自动计算step数的?

toufunao avatar Apr 22 '25 08:04 toufunao

加了packing

Jintao-Huang avatar Apr 23 '25 06:04 Jintao-Huang

或者你看看 NPROC_PER_NODE是否设置正常

Jintao-Huang avatar Apr 23 '25 06:04 Jintao-Huang

加了packing 谢谢指正,刚刚重新看了一下启动脚本,并没有使用packing,使用了sequence_parallel进行训练。 验证NPROC_PER_NODE也是正常的,world_size在log中也是4。但是step数和手动计算的值仍然有误差

nproc_per_node=4 NPROC_PER_NODE=$nproc_per_node CUDA_VISIBLE_DEVICES=0,1,2,3 swift pt --model Qwen/Qwen2.5-7B --train_type full --dataset $CUSTOM_DATASET --torch_dtype bfloat16 --num_train_epochs 3 --sequence_parallel 4 --per_device_train_batch_size 16 --per_device_eval_batch_size 1 --learning_rate 1e-5 --gradient_accumulation_steps $(expr 128 / $nproc_per_node) --eval_steps 10 --save_steps 50 --save_total_limit 2 --logging_steps 5 --deepspeed zero3 --max_length 8192 --warmup_ratio 0.05 --save_only_model true --output_dir XXXXX

toufunao avatar Apr 23 '25 07:04 toufunao

This issue has been inactive for over 3 months and will be automatically closed in 7 days. If this issue is still relevant, please reply to this message.

github-actions[bot] avatar Jul 23 '25 00:07 github-actions[bot]

This issue has been automatically closed due to inactivity. If needed, it can be reopened.

github-actions[bot] avatar Aug 03 '25 00:08 github-actions[bot]