Missing optimizer step
If max_steps or data length is not divisible by gradient_accumulation_steps some gradients are lost. Since updating only takes place at if (step + 1) % gradient_accumulation_steps == 0:
Hi @FlorisFok, do you have suggestions as to how this should be fixed?
Hi @timoschick, by adding an OR statement to the gradient accumulation if statement. This OR statement could also execute when the loop reaches the final batch.
last_batch = len(train_dataloader) - 1
The modify the following:
if (step + 1) % gradient_accumulation_steps == 0 or last_batch == b_nr:
Where b_nr (batch_number) can be extracted from the first argument coming from the enumerate function. Theoretically, this should use the step variable already in the script, but this behaves exactly the same as the global_step. I think that's also a mistake, but that depends on the definition of the two.