MS-AMP icon indicating copy to clipboard operation
MS-AMP copied to clipboard

Optimizer datatype

Open brianchmiel opened this issue 1 year ago • 3 comments

Hi,

I have some question related to the paper:

  1. Which FP8 format (E4M3 / E5M2) do you use for the First Adam moment? Do you use Delayed scaling or just-in-time scaling?
  2. What about the weight gradient - do you use E4M3 with Delayed scaling?

brianchmiel avatar Mar 07 '24 11:03 brianchmiel

Thanks for your attention to our work!

  1. The datatype of the first moment is fp8-e4m3, and that of the second one is fp16. They are both scaling tensors with scaling factors,which are computed just in time.
  2. The weight gradient is a fp8-e4m3 scaling tensor with a just-in-time scaling factor.

wkcn avatar Mar 08 '24 01:03 wkcn

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23

brianchmiel avatar Mar 10 '24 13:03 brianchmiel

Thank you for your answer. So, why is the reason you define the first moment as uint8 datatype :

https://github.com/Azure/MS-AMP/blob/0a2cd721fa68ee725e3b2fb132df02ddb8069d62/msamp/init.py#L81C9-L81C23

There is no native FP8 datatype in PyTorch yet, therefore we use uint8 to store FP8-E4M3 value.

wkcn avatar Mar 10 '24 15:03 wkcn

Close the issue since there is no activity for a long time.

tocean avatar Aug 02 '24 10:08 tocean