FP8 not converging during Supervised Fine-Tuning (though BF16 is)
Hey guys,
First of all, I'm super happy using TE and the speed-up of using fp8 is amazing.
Using https://github.com/mosaicml/llm-foundry I was able to pre-train a 1B parameter model. However, when performing SFT, my loss does not seem to converge. I had tried everything, but train loss kept going up. To my surprise, when switching over to BF16 over fp8 my loss was reducing in a stable fashion.
This could very well be an issue with foundry, though, perhaps this also seems like an appropriate place. Would anyone know why this is happening?
(Grey is fp8, purple is bf16)
Hi, could you give more details about this experiment? Is it using some public model architecture and dataset (so that we could try reproducing it) or a proprietary one?