Victor Zhu
Victor Zhu
Thanks @ptrendx for the information! Yes, in this case we're running with PyTorch FSDP FULL_SHARD on a HF llama model, with the `nn.Linear` layers directly replaced with `te.Linear` (TE v1.2)...
Yes, here's a script reproducing the issue comparing the output of `nn.Linear` BF16 to `te.Linear` FP8 for a single gpu. Please let me know if you see anything wrong w/...
Oh awesome, thanks for the catch and sanity check! I'll look closer in my implementation then, something else must be going wrong.
I actually just re-ran the script with your bias fix in my environment along with changing the input `x` generation updated from `rand()` to `randn()`, and I see a greater...
Thanks for the responses! I re-ran with recomputation disabled, and also reduced the `num_layers` from 32 -> 16 (due to memory constraints) and still observe a loss difference (though the...
I see, sounds good will try it! I also ran the bf16/fp08 no recompute jobs for a bit longer and observe the following: ``` # FP08 iteration 1000/ 20000 |...
I think it depends on your config and hardware. For context, I was using 4 nodes each with 8xH100 for my experiments (you can check my logs above for arguments)....
I think the issue slipped into the TEv1.8 release as I had the same installation issue which was resolved by cherry-picking https://github.com/NVIDIA/TransformerEngine/pull/949.