NeMo Precision Problem between nemo model and hugging face model

Describe the bug

We are using nemo to training our large vision language model. When converting models from nemo format to hugging face format, we found that given the same inputs and weights, we get different outputs.

We test that after layer normalization, even using the same weights and input, outputs are different. Nemo use transformers engine and below code to calculate:

飞书20240508-190258

And hugging face using pytorch

And I also found there also exists little precision gap in rotational positional embedding、attention and FFN.

Expected behavior

Is the precision gap caussed by different calculation operator? How can I fix that?

Thank you!

May 08 '24 11:05 ChencongZJU

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

May 08 '24 15:05 yaoyu-33

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

Thanks for your patient reply. We also test that the precission gab doesn't affect perfoemance.

May 16 '24 06:05 ChencongZJU