Victor Zhu comments

Results 9 comments of


                                            Victor Zhu

SageMaker Sharded Data Parallel Support for Trainer

@sgugger

Replacing nn.Linear w/ te.Linear FP8 convergence issue

Thanks @ptrendx for the information! Yes, in this case we're running with PyTorch FSDP FULL_SHARD on a HF llama model, with the `nn.Linear` layers directly replaced with `te.Linear` (TE v1.2)...

Replacing nn.Linear w/ te.Linear FP8 convergence issue

Yes, here's a script reproducing the issue comparing the output of `nn.Linear` BF16 to `te.Linear` FP8 for a single gpu. Please let me know if you see anything wrong w/...

Replacing nn.Linear w/ te.Linear FP8 convergence issue

Oh awesome, thanks for the catch and sanity check! I'll look closer in my implementation then, something else must be going wrong.

Replacing nn.Linear w/ te.Linear FP8 convergence issue

I actually just re-ran the script with your bias fix in my environment along with changing the input `x` generation updated from `rand()` to `randn()`, and I see a greater...

[BUG] Loss difference when training with FP8 vs. BF16 MoE

Thanks for the responses! I re-ran with recomputation disabled, and also reduced the `num_layers` from 32 -> 16 (due to memory constraints) and still observe a loss difference (though the...

[BUG] Loss difference when training with FP8 vs. BF16 MoE

I see, sounds good will try it! I also ran the bf16/fp08 no recompute jobs for a bit longer and observe the following: ``` # FP08 iteration 1000/ 20000 |...

[BUG] Loss difference when training with FP8 vs. BF16 MoE

I think it depends on your config and hardware. For context, I was using 4 nodes each with 8xH100 for my experiments (you can check my logs above for arguments)....

Can't build the TE wheel via pip (1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu")

I think the issue slipped into the TEv1.8 release as I had the same installation issue which was resolved by cherry-picking https://github.com/NVIDIA/TransformerEngine/pull/949.