verl Does verl support Breakpoint retraining ？

After I check the project , I did not find the optimizer state saving code but only model params.

Jan 27 '25 09:01 DavideHe

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

Jan 27 '25 12:01 PeterSH6

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

wow, expect your update!!

Jan 28 '25 00:01 DavideHe

Hi @PeterSH6, do we have any plans for when this would be added? This will be super helpful for those working on SLURM with limited wall time!

Feb 03 '25 16:02 jaehunjung1

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

verl framework is running on all gpus for every model of policy ,ref,reward,value,vllm_policy?

Feb 06 '25 02:02 DavideHe

@jaehunjung1 @DavideHe . The breakpoint retraining is added in #222 . You can try this feature now.

Feb 09 '25 14:02 PeterSH6

verl framework is running on all gpus for every model of policy ,ref,reward,value,vllm_policy?

@DavideHe For the main_ppo.py entry point, yes.

For the split placement example, it splits the actor/ref to one set of GPUs and places the rm and critic to the other GPUs. You can customize your placement by following the split placement example.

But we recommend using colocate placement for most cases

Feb 09 '25 14:02 PeterSH6

checkpoint is available in v0.2 release.

Feb 23 '25 23:02 eric-haibin-lin