ZZK

Results 7 comments of ZZK

> [@HillDing](https://github.com/HillDing) Which version of verl and megatron you use? We recently fix the logic to use mbridge to save checkpoints, could you try to use new `megatron_checkpoint_manager`? Actually, after...

> The script can also run on 4 H100 nodes, provided each node has sufficient CPU memory (>1.5 TB). We look forward to your feedback. sorry, I tried training on...

> The `actor_rollout_ref.rollout.gpu_memory_utilization` is too high in your script. Please set to a lower value and test again. Maybe 0.7? Thanks, it's runining! But, when I save the ckpt, I...

> I see no recompute options in the scripts, maybe you can try with enabling full_recompute, see deepseek script on how to enable full_recompute. > > Also remember to adopt...

Device: 4* H100 (80GB), cpu memory: 1.7TB use official script to run 235B moe for cuda oom: we can reduce batch size to 2 or 1, and set --balance_batch false...

> > > The `actor_rollout_ref.rollout.gpu_memory_utilization` is too high in your script. Please set to a lower value and test again. Maybe 0.7? > > > > > > Thanks, it's...