DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Addressing ipg Buffer Data Race Condition in Zero Stage2

Open xxr3376 opened this issue 2 years ago • 0 comments

This Pull Request addresses a potential gradient data corruption issue stemming from an update made in https://github.com/microsoft/DeepSpeed/pull/2500. That PR introduced changes to the synchronization pattern that enabled overlap of communication and backward computation in ZERO stage2. However, it also inadvertently resulted in a potential data race condition due to the lack of synchronization before reusing the ipg buffer.

To rectify this issue, this PR adds the missing synchronization back. A finish event is recorded for the reduction stream of each ipg buffer. This modification ensures that before reusing a buffer, we ascertain the previous reader has completed its tasks, thereby preventing the possibility of data corruption.

This fix has been thoroughly tested on a pretraining model spread across more than 60 nodes. It has effectively addressed the issue, evidenced by the corrected convergence curve. As expected, the FLOPS performance saw a slight decline due to the added synchronization.

xxr3376 avatar Jun 09 '23 11:06 xxr3376