Julien Launay

Results 3 comments of Julien Launay

Hmm, I could be wrong, but this `all_reduce` only acts across **model** parallelism. My understanding of Appendix A.1. is that this should be done across **data** parallelism instead (so actually...

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/7b9988146881f6eee33f69c28a92ae03e2678e42/megatron/model/distributed.py#L188-L218 This is where the data parallelism `all_reduce` occurs in the "simplified" DDP implemented by Megatron (`DDP_impl` in `local` mode). If the PyTorch DDP is used instead, then we would...

I don't think this is an acceptable solution, as ZeRO-DP comes with some pretty nice savings (plus we wouldn't want to maintain two different data parallelism scheme just for gradient...