GSPO+lora训练比GSPO+full慢4倍
System Info
Training Progress: 51%|█████ | 204/400 [2:26:32<2:20:47, 43.10s/it]
LORA
Training Progress: 1%| | 4/400 [10:52<17:55:33, 162.96s/it]
采用的是相同的训练脚本,仅仅增加了
actor_rollout_ref.model.lora_rank=16
actor_rollout_ref.model.lora_alpha=32 \
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
训练脚本 set -xeuo pipefail
export NCCL_IBEXT_DISABLE=1 export NCCL_NVLS_ENABLE=1 export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 export GPUS_PER_NODE=8 export VLLM_ATTENTION_BACKEND=FLASH_ATTN export RAY_LOGGING_LEVEL=DEBUG export HYDRA_FULL_ERROR=1
python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
actor_rollout_ref.actor.policy_loss.loss_mode=gspo
data.train_files=/app/train/input_data/train.parquet
data.val_files=/app/train/input_data/test.parquet
data.filter_overlong_prompts=true
data.train_batch_size=512
data.max_prompt_length=2048
data.max_response_length=1024
actor_rollout_ref.rollout.n=5
algorithm.use_kl_in_reward=false
algorithm.kl_ctrl.kl_coef=0.0
actor_rollout_ref.actor.kl_loss_coef=0.0
actor_rollout_ref.actor.clip_ratio_low=0.0003
actor_rollout_ref.actor.clip_ratio_high=0.0004
actor_rollout_ref.model.use_remove_padding=true
actor_rollout_ref.actor.use_dynamic_bsz=true
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=true
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=true
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=20480
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30720
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30720
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.mode=sync
actor_rollout_ref.model.lora_rank=16
actor_rollout_ref.model.lora_alpha=32
actor_rollout_ref.model.path=/models/Qwen/Qwen2.5-7B-Instruct
actor_rollout_ref.model.enable_gradient_checkpointing=true
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05
actor_rollout_ref.actor.optim.weight_decay=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.fsdp_config.param_offload=true
actor_rollout_ref.actor.fsdp_config.optimizer_offload=true
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-mean
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.max_num_batched_tokens=10240
actor_rollout_ref.ref.fsdp_config.param_offload=true
actor_rollout_ref.actor.entropy_checkpointing=true
reward_model.reward_manager=dapo
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=false
+reward_model.reward_kwargs.overlong_buffer_cfg.len=4096
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=1.0
+reward_model.reward_kwargs.overlong_buffer_cfg.log=false
+reward_model.reward_kwargs.max_resp_len=8192
custom_reward_function.path=/app/train/reward_func/reward.py
custom_reward_function.name=my_reward_fn
trainer.rollout_data_dir=/app/train/train_record/rlhf/v5_gspo_lora/dump/rollout
trainer.validation_data_dir=/app/train/train_record/rlhf/v5_gspo_lora/dump/val
trainer.logger='["console","tensorboard"]'
trainer.project_name="RL-GSPO"
trainer.experiment_name="gspo"
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.save_freq=20
trainer.test_freq=5
trainer.total_epochs=100
Expected behavior
希望速度上保持一致
v0.6.0版本,gen会很慢,但是v0.5.0版本是正常的【使用GRPO测试】
This is not a bug, generation with lora is supposed to be slow. A better approach to do lora RL is to merge weights back to main and perform normal full generation.
Thank you for your reply. Why is version v0.5.0 faster for me? And I'd like to ask if there's a configuration option to merge the LoRa weights back into the main model? Merging is quite time-consuming, isn't it?
This is not a bug, generation with lora is supposed to be slow. A better approach to do lora RL is to merge weights back to main and perform normal full generation.
This is a great idea. Do we have any plans to implement the feature?