[QUESTION] For DDP, why map parameter's main_grad to grad buffer instead of grad?
@deepakn94 Hi, I'm diving deep into Megatron-LM's implementation. For DDP wrapper, the current implementation maps each parameter's main_grad to grad buffer.
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/grad_buffer.py#L272-L280
And then in backward hook, add grad to main_grad manually.
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/distributed/distributed_data_parallel.py#L151-L154
My question is why not just map parameter's grad to grad buffer directly and let torch accumulating them automatically?
Marking as stale. No activity in 60 days.
PyTorch’s autograd functionality assumes that a model parameter and its corresponding gradient have the same data type. However, while low-precision data types like FP8 are sufficient for evaluating a neural network’s forward and backward passes, the optimization step typically requires full FP32 precision to avoid significant learning degradation. In addition, Tensor Cores on Hopper GPUs have the option to accumulate matrix products directly into FP32, resulting in better numerical accuracy and avoiding the need for a separate casting kernel. Thus, Transformer Engine provides an option to directly generate FP32 gradients for weight tensors. The FP32 gradients are not output to the parameter’s grad tensor, but rather to a main_grad tensor that must be initialized before the backward pass. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/advanced_optimizations.html)
@hxdtest Got it, thanks for your kind reply!