Addressing ipg Buffer Data Race Condition in Zero Stage2
This Pull Request addresses a potential gradient data corruption issue stemming from an update made in https://github.com/microsoft/DeepSpeed/pull/2500. That PR introduced changes to the synchronization pattern that enabled overlap of communication and backward computation in ZERO stage2. However, it also inadvertently resulted in a potential data race condition due to the lack of synchronization before reusing the ipg buffer.
To rectify this issue, this PR adds the missing synchronization back. A finish event is recorded for the reduction stream of each ipg buffer. This modification ensures that before reusing a buffer, we ascertain the previous reader has completed its tasks, thereby preventing the possibility of data corruption.
This fix has been thoroughly tested on a pretraining model spread across more than 60 nodes. It has effectively addressed the issue, evidenced by the corrected convergence curve. As expected, the FLOPS performance saw a slight decline due to the added synchronization.