verl GSPO+lora训练比GSPO+full慢4倍

System Info

Training Progress: 51%|█████ | 204/400 [2:26:32<2:20:47, 43.10s/it] LORA Training Progress: 1%| | 4/400 [10:52<17:55:33, 162.96s/it] 采用的是相同的训练脚本，仅仅增加了 actor_rollout_ref.model.lora_rank=16
actor_rollout_ref.model.lora_alpha=32 \

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

训练脚本 set -xeuo pipefail

export NCCL_IBEXT_DISABLE=1 export NCCL_NVLS_ENABLE=1 export NCCL_IB_HCA=mlx5 export UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1 export GPUS_PER_NODE=8 export VLLM_ATTENTION_BACKEND=FLASH_ATTN export RAY_LOGGING_LEVEL=DEBUG export HYDRA_FULL_ERROR=1

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
actor_rollout_ref.actor.policy_loss.loss_mode=gspo
data.train_files=/app/train/input_data/train.parquet
data.val_files=/app/train/input_data/test.parquet
data.filter_overlong_prompts=true
data.train_batch_size=512
data.max_prompt_length=2048
data.max_response_length=1024
actor_rollout_ref.rollout.n=5
algorithm.use_kl_in_reward=false
algorithm.kl_ctrl.kl_coef=0.0
actor_rollout_ref.actor.kl_loss_coef=0.0
actor_rollout_ref.actor.clip_ratio_low=0.0003
actor_rollout_ref.actor.clip_ratio_high=0.0004
actor_rollout_ref.model.use_remove_padding=true
actor_rollout_ref.actor.use_dynamic_bsz=true
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=true
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=true
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=20480
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=30720
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=30720
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.mode=sync
actor_rollout_ref.model.lora_rank=16
actor_rollout_ref.model.lora_alpha=32
actor_rollout_ref.model.path=/models/Qwen/Qwen2.5-7B-Instruct
actor_rollout_ref.model.enable_gradient_checkpointing=true
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.05
actor_rollout_ref.actor.optim.weight_decay=0.1
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
actor_rollout_ref.actor.fsdp_config.param_offload=true
actor_rollout_ref.actor.fsdp_config.optimizer_offload=true
actor_rollout_ref.actor.grad_clip=1.0
actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-mean
actor_rollout_ref.rollout.gpu_memory_utilization=0.8
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.max_num_batched_tokens=10240
actor_rollout_ref.ref.fsdp_config.param_offload=true
actor_rollout_ref.actor.entropy_checkpointing=true
reward_model.reward_manager=dapo
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=false
+reward_model.reward_kwargs.overlong_buffer_cfg.len=4096
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=1.0
+reward_model.reward_kwargs.overlong_buffer_cfg.log=false
+reward_model.reward_kwargs.max_resp_len=8192
custom_reward_function.path=/app/train/reward_func/reward.py
custom_reward_function.name=my_reward_fn
trainer.rollout_data_dir=/app/train/train_record/rlhf/v5_gspo_lora/dump/rollout
trainer.validation_data_dir=/app/train/train_record/rlhf/v5_gspo_lora/dump/val
trainer.logger='["console","tensorboard"]'
trainer.project_name="RL-GSPO"
trainer.experiment_name="gspo"
trainer.n_gpus_per_node=8
trainer.nnodes=1
trainer.save_freq=20
trainer.test_freq=5
trainer.total_epochs=100

Expected behavior

希望速度上保持一致

Nov 06 '25 06:11 xxzhang0927

v0.6.0版本，gen会很慢，但是v0.5.0版本是正常的【使用GRPO测试】

Nov 06 '25 09:11 xxzhang0927

This is not a bug, generation with lora is supposed to be slow. A better approach to do lora RL is to merge weights back to main and perform normal full generation.

Nov 06 '25 13:11 vermouth1992

Thank you for your reply. Why is version v0.5.0 faster for me? And I'd like to ask if there's a configuration option to merge the LoRa weights back into the main model? Merging is quite time-consuming, isn't it?

Nov 07 '25 01:11 xxzhang0927

This is not a bug, generation with lora is supposed to be slow. A better approach to do lora RL is to merge weights back to main and perform normal full generation.

This is a great idea. Do we have any plans to implement the feature?

Nov 10 '25 16:11 zhexjtu