transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Cannot resume FSDP optimizer state

Open qywu opened this issue 3 years ago • 1 comments

This line does not save optimizer state correctly when using FSDP.

https://github.com/huggingface/transformers/blob/88399476c3892435395618ed37993176dbb0de73/src/transformers/trainer.py#L2383

It should use FSDP's full_optim_state_dict to collect optimizer states from different processes.

FSDP.full_optim_state_dict(self.model, self.optimizer)

qywu avatar Apr 27 '23 20:04 qywu

cc @pacman100

sgugger avatar Apr 27 '23 20:04 sgugger

Hello @qywu, indeed, that seems to be the case, as you already have the fix, it would be great if you could raise the PR with the fixes, Thank you!

pacman100 avatar May 02 '23 05:05 pacman100