Xiaowei Ren

Results 8 issues of Xiaowei Ren

With the latest implementation of Latency Hiding Scheduling, we observe that most weight gradient all-reduce latency is still exposed. [(ref slide 6 and 7 at here)](https://docs.google.com/presentation/d/1s2B4DPuhOVQbJ4SAZA7XWBKL5ST-Dfcn/edit#slide=id.g1895a52e93e_0_0) Here is a brief...

By default, layout assignment tries to assign a layout to transposes that make them a bitcast. This layout is then propagated inside the HloComputation, which means if it does not...

# What does this PR do ? Remove unnecessary attention masks. [Related MCore MR is here.](https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1259)

core
NLP

# Description This is a CP implementation variant with KV all-gather. Currently, it can support: - sliding window attention + causal + FlashAttention - full window attention + causal +...

# Description This PR adds a hierarchical implementation of context parallelism to attention. It uses A2A communications in low-level CP groups (e.g., via NVLink), and P2P communications in high-level CP...

# Description In TE-Pytorch, built a pybind of `sm_arch` to cache device compute capability, thereby removing the CPU overheads of multiple calls to `torch.cuda.get_device_properties()`. Also fixed `batch_p2p_comm` check by making...

# Description - softmax_lse correction is in FP64 now, we can lower it to FP32. - use `log1p` to be consistent with [PR1401](https://github.com/NVIDIA/TransformerEngine/pull/1401). ## Type of change - [ ]...

> [!IMPORTANT] > The `Update branch` button must only be pressed in very rare occassions. > An outdated branch is never blocking the merge of a PR. > Please reach...

Run CICD