MS-AMP Optimizer datatype

Hi,

I have some question related to the paper:

Which FP8 format (E4M3 / E5M2) do you use for the First Adam moment? Do you use Delayed scaling or just-in-time scaling?
What about the weight gradient - do you use E4M3 with Delayed scaling?

Mar 07 '24 11:03 brianchmiel

Thanks for your attention to our work!

The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with scaling factors，which are computed just in time.
The weight gradient is a fp8-e4m3 scaling tensor with a just-in-time scaling factor.

Mar 08 '24 01:03 wkcn

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23

Mar 10 '24 13:03 brianchmiel

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23

There is no native FP8 datatype in PyTorch yet, therefore we use uint8 to store FP8-E4M3 value.

Mar 10 '24 15:03 wkcn

Close the issue since there is no activity for a long time.

Aug 02 '24 10:08 tocean