Optimizer datatype
Hi,
I have some question related to the paper:
- Which FP8 format (E4M3 / E5M2) do you use for the First Adam moment? Do you use Delayed scaling or just-in-time scaling?
- What about the weight gradient - do you use E4M3 with Delayed scaling?
Thanks for your attention to our work!
- The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with scaling factors,which are computed just in time.
- The weight gradient is a fp8-e4m3 scaling tensor with a just-in-time scaling factor.
Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :
https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23
Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :
https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23
There is no native FP8 datatype in PyTorch yet, therefore we use uint8 to store FP8-E4M3 value.
Close the issue since there is no activity for a long time.