add fsdp sft checkpoint manager
Checklist Before Starting
- [x] Search for similar PR(s).
What does this PR do?
Add checkpoint manager to support save & load checkpoints for fsdp sft trainer.
High-Level Design
Demonstrate the high-level design if this PR is complex.
Specific Changes
Use fsdp_checkpoint_manager as the checkpoint manager.
API
Demonstrate how the API changes if any.
Usage Example
bash examples/sft/gsm8k/run_qwen_05.sh
Test
For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.
Additional Info.
- Issue Number: Fixes issue # or discussion # if any.
- Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
- Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]
Checklist Before Submitting
- [x] Read the Contribute Guide.
- [x] Apply pre-commit checks.
- [ ] Add
[BREAKING]to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the docs.
- [ ] Add CI test(s) if neccessary.
@MaxwellJryao Could you rebase and look into the bug please? I can retrigger CI when finishing~
Hi @MaxwellJryao
I'm currently working on improving checkpoint resume as well (see PR #2292), and the maintainer suggested checking if we could combine parts of our implementations. Would you be okay with me cherry-picking some implementations? Or are you still actively maintaining that PR?
Happy to discuss!