accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Accelerate 0.31.0 gradient accumulation bug.

Open nikitabalabin opened this issue 1 year ago • 1 comments

System Info

I have updated to accelerate 0.31.0 from 0.30.0 and all my trainings with gradient_accumulation_steps > 1 started to collapse. Please double check that everything is ok.

Reproduction

mixed_precision='fp16' gradient_accumulation_steps > 1

Expected behavior

the training should be stable with both gradient_accumulation_steps = 1 and gradient_accumulation_steps > 1

nikitabalabin avatar Jun 17 '24 21:06 nikitabalabin

all my trainings with gradient_accumulation_steps > 1 started to collapse.

Could you please provide more details. What does "collapse" mean?

Moreover, could you share your accelerate env and if possible, the code to reproduce the failing training?

BenjaminBossan avatar Jun 18 '24 12:06 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jul 18 '24 15:07 github-actions[bot]