Qi Penghui
Qi Penghui
It's about what the ppo loss are optimizing, instead of any feature or any reward shaping. So it's actually an implementation bug from mathematical derivations.
Is this config correct? I don't think it can fit in 32 GPUs ?
I'm not sure whether recent changes break the behavior you described. I observed an error: `world_size ({world_size}) is not divisible by expert_tensor_model_pipeline_parallel size (128)` Adding `actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=1` can fix this error....