transformers
transformers copied to clipboard
Cannot resume FSDP optimizer state
This line does not save optimizer state correctly when using FSDP.
https://github.com/huggingface/transformers/blob/88399476c3892435395618ed37993176dbb0de73/src/transformers/trainer.py#L2383
It should use FSDP's full_optim_state_dict to collect optimizer states from different processes.
FSDP.full_optim_state_dict(self.model, self.optimizer)
cc @pacman100
Hello @qywu, indeed, that seems to be the case, as you already have the fix, it would be great if you could raise the PR with the fixes, Thank you!