verl icon indicating copy to clipboard operation
verl copied to clipboard

Does verl support Breakpoint retraining ?

Open DavideHe opened this issue 1 year ago • 6 comments

After I check the project , I did not find the optimizer state saving code but only model params.

DavideHe avatar Jan 27 '25 09:01 DavideHe

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

PeterSH6 avatar Jan 27 '25 12:01 PeterSH6

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

wow, expect your update!!

DavideHe avatar Jan 28 '25 00:01 DavideHe

Hi @PeterSH6, do we have any plans for when this would be added? This will be super helpful for those working on SLURM with limited wall time!

jaehunjung1 avatar Feb 03 '25 16:02 jaehunjung1

Hi @DavideHe We have a checkpoint manager for optimizer and enable breakpoint retraining internally. Will PR soon.

verl framework is running on all gpus for every model of policy ,ref,reward,value,vllm_policy?

DavideHe avatar Feb 06 '25 02:02 DavideHe

@jaehunjung1 @DavideHe . The breakpoint retraining is added in #222 . You can try this feature now.

PeterSH6 avatar Feb 09 '25 14:02 PeterSH6

verl framework is running on all gpus for every model of policy ,ref,reward,value,vllm_policy?

@DavideHe For the main_ppo.py entry point, yes.

For the split placement example, it splits the actor/ref to one set of GPUs and places the rm and critic to the other GPUs. You can customize your placement by following the split placement example.

But we recommend using colocate placement for most cases

PeterSH6 avatar Feb 09 '25 14:02 PeterSH6

checkpoint is available in v0.2 release.

eric-haibin-lin avatar Feb 23 '25 23:02 eric-haibin-lin