[ENHANCEMENT]Is Megatron planning to use flux technology?Integrating communication and gemm into one operator to improve overlap rate
https://arxiv.org/abs/2406.06858v1
https://github.com/bytedance/flux
Is Megatron planning to use flux technology?Integrating communication and gemm into one operator to improve overlap rate.
We are looking into an approach to fuse the GEMM and its dependent communication into a single kernel. The support for such an optimization will take some time to ensure reliability.
Marking as stale. No activity in 60 days.
@erhoo82 Hi, is there any progress on this? Flux is really hard to train and we are really looking forward to the support of Megatron...
Marking as stale. No activity in 60 days.
@erhoo82 Hi, is there any progress on this? Flux is really hard to train and we are really looking forward to the support of Megatron...
Are you working on this? I'm also interested.
Marking as stale. No activity in 60 days.
Sharing updates.
Currently, Megatron-LM supports the overlap of Tensor-parallel communications with computations using the split GEMM and communication kernels from Transformer Engine. Transformer Engine plans to migrate the overlap implementation from a custom in-package build (userbuffer) to cublasMp backend that uses NVSHMEM. This new backend will still use split kernels for GEMM and communication.
We are still discussing a single-kernel implementation, but no detailed plan has been established yet.