Antoni-Joan Solergibert issues

Results 4 issues of


                                            Antoni-Joan Solergibert

Not able to run Zephyr 7B Gemma with 4 80GB A100s

I'm not able to run Zephyr 7B Gemma with 4 80GB A100s. I get the following error: ``` RuntimeError: The size of tensor a (0) must match the size of...

[QUESTION] Megatron-LM `DistributedOptimizer` or NeMo `MegatronDistributedFusedAdam` Optimizer?

Hi, After going across both Megatron-LM & NeMo I've found that NeMo configs set by default the [`MegatronDistributedFusedAdam`](https://github.com/NVIDIA/NeMo/blob/874a1eab03fa49e6a10e00ce9518cba699d7eb37/nemo/core/optim/distributed_adam.py#L95) optimizer from the NeMo framework. But Megatron also contains a [`DistributedOptimizer`](https://github.com/NVIDIA/Megatron-LM/blob/fd3c77115c912e67b831c590bdc4f5e08e42f166/megatron/core/optimizer/distrib_optimizer.py#L65). The...

[Feature] Enable CUDNN Attention

Recently PyTorch integrated the CUDNN attention backend to `torch.nn.functional.scaled_dot_product_attention`. I've tried `torch.nn.functional.scaled_dot_product_attention ` in 2 different machines W/ H100s and in both machines it dispatches `SDPBackend.FLASH_ATTENTION`. Manually switching to `SDPBackend.CUDNN_ATTENTION`...

[QUESTION] How to overlap comms and computation with `--overlap-grad-reduce` & `--overlap-param-gather`?

Hello Megatron team, Visualising a trace setting `--overlap-grad-reduce` & `--overlap-param-gather` I observed that we aren't overlapping anything when setting those flags. Im running a 8B model with TP = 1,...

stale