Raghu Ganti
Raghu Ganti
created by @yuanchi2807
The basic utility has been added, putting it as an actual node needs more work.
@klwuibm Suggest that you fill in the rest of the issue template? :)
Thanks @klwuibm !
@RuedigerMoeller Why use such an old version of Jackson?
We have tried this script on AMD GPUs and it works for LoRa and full fine tuning. We have not tried bits-n-bytes.
yes, this needs to be updated. The MFU computations for `fp8` are too good to be true :) CC: @lchu-ibm
Barebones example of fsdpv2 is available in https://github.com/pytorch/torchtitan.
@lchu6 would it make sense to start with MoE only, since MoE is the main thing that complicates parallelism strategy. We can add Mamba2 later.
@AdnanHoque has been working on benchmarking the kernels that he developed with Less, so it may be worth checking how the comms and compute look at the choices we make...