[BUG] Gradient Accumulation Steps Initialization Bug in Pipeline Parallel Mode

Open fwyc0573 opened this issue 1 year ago • 0 comments

Describe the bug I reviewed the initialization of self.gradient_accumulation_steps in the DeepSpeedConfig module when only train_batch and micro_batch are set (deepspeed Version: 0.13.1)：

grad_acc = train_batch // micro_batch
grad_acc //= self.world_size
self.gradient_accumulation_steps = grad_acc

However, in the PP+DP (Pipeline Parallel + Data Parallel) mode, not every rank is assigned a batch for training. Therefore, should the above formula replace self.world_size with dp_degree? Correspondingly, the check for train_batch should be：

 train_batch = grad_acc * micro_batch * dp_degree

The current initialization results in an unexpected calculation of grad_acc during my PP+DP training. I'm unsure if my understanding is incorrect; please correct me if necessary. Thank you.

Apr 15 '24 07:04 fwyc0573