Accelerate 0.31.0 gradient accumulation bug.
System Info
I have updated to accelerate 0.31.0 from 0.30.0 and all my trainings with gradient_accumulation_steps > 1 started to collapse. Please double check that everything is ok.
Reproduction
mixed_precision='fp16' gradient_accumulation_steps > 1
Expected behavior
the training should be stable with both gradient_accumulation_steps = 1 and gradient_accumulation_steps > 1
all my trainings with gradient_accumulation_steps > 1 started to collapse.
Could you please provide more details. What does "collapse" mean?
Moreover, could you share your accelerate env and if possible, the code to reproduce the failing training?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.