Cannot resume FSDP optimizer state

Open qywu opened this issue 3 years ago • 1 comments

This line does not save optimizer state correctly when using FSDP.

https://github.com/huggingface/transformers/blob/88399476c3892435395618ed37993176dbb0de73/src/transformers/trainer.py#L2383

It should use FSDP's full_optim_state_dict to collect optimizer states from different processes.

FSDP.full_optim_state_dict(self.model, self.optimizer)

Apr 27 '23 20:04 qywu

cc @pacman100

Apr 27 '23 20:04 sgugger

Hello @qywu, indeed, that seems to be the case, as you already have the fix, it would be great if you could raise the PR with the fixes, Thank you!

May 02 '23 05:05 pacman100