verl icon indicating copy to clipboard operation
verl copied to clipboard

add fsdp sft checkpoint manager

Open MaxwellJryao opened this issue 8 months ago • 2 comments

Checklist Before Starting

  • [x] Search for similar PR(s).

What does this PR do?

Add checkpoint manager to support save & load checkpoints for fsdp sft trainer.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

Use fsdp_checkpoint_manager as the checkpoint manager.

API

Demonstrate how the API changes if any.

Usage Example

bash examples/sft/gsm8k/run_qwen_05.sh

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

  • [x] Read the Contribute Guide.
  • [x] Apply pre-commit checks.
  • [ ] Add [BREAKING] to the PR title if it breaks any API.
  • [ ] Update the documentation about your changes in the docs.
  • [ ] Add CI test(s) if neccessary.

MaxwellJryao avatar May 12 '25 13:05 MaxwellJryao

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 12 '25 13:05 CLAassistant

@MaxwellJryao Could you rebase and look into the bug please? I can retrigger CI when finishing~

ETOgaosion avatar May 24 '25 05:05 ETOgaosion

Hi @MaxwellJryao

I'm currently working on improving checkpoint resume as well (see PR #2292), and the maintainer suggested checking if we could combine parts of our implementations. Would you be okay with me cherry-picking some implementations? Or are you still actively maintaining that PR?

Happy to discuss!

Pursuer-Hsf avatar Jul 02 '25 13:07 Pursuer-Hsf