Xiaowei Ren issues

Results 8 issues of


                                            Xiaowei Ren

Most latency of weight gradient all-reduce is exposed

With the latest implementation of Latency Hiding Scheduling, we observe that most weight gradient all-reduce latency is still exposed. [(ref slide 6 and 7 at here)](https://docs.google.com/presentation/d/1s2B4DPuhOVQbJ4SAZA7XWBKL5ST-Dfcn/edit#slide=id.g1895a52e93e_0_0) Here is a brief...

Configure layout assignment for transposes

By default, layout assignment tries to assign a layout to transposes that make them a bitcast. This layout is then propagated inside the HloComputation, which means if it does not...

Remove unnecessary attention mask

# What does this PR do ? Remove unnecessary attention masks. [Related MCore MR is here.](https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1259)

core

NLP

Xiaowei Ren

Most latency of weight gradient all-reduce is exposed

Configure layout assignment for transposes

Remove unnecessary attention mask

Add a CP implementation variant with KV all-gather.

Hierarchical CP implementation (Ulysses + Ring)

Remove CPU overheads of torch.cuda.get_device_properties() by caching it

change softmax_lse correction of CP to FP32

Variable global and micro batch sizes for different GPUs