NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Precision Problem between nemo model and hugging face model

Open ChencongZJU opened this issue 1 year ago • 1 comments

Describe the bug

We are using nemo to training our large vision language model. When converting models from nemo format to hugging face format, we found that given the same inputs and weights, we get different outputs.

We test that after layer normalization, even using the same weights and input, outputs are different. Nemo use transformers engine and below code to calculate:

飞书20240508-190258

And hugging face using pytorch image

And I also found there also exists little precision gap in rotational positional embedding、attention and FFN.

Expected behavior

Is the precision gap caussed by different calculation operator? How can I fix that?

Thank you!

ChencongZJU avatar May 08 '24 11:05 ChencongZJU

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

yaoyu-33 avatar May 08 '24 15:05 yaoyu-33

Hi, we are aware that some TE implementations won't generate identical results to those of HF (which uses native PyTorch). We use our fused version of operations to speed up training. It seems you are using the Llama model as a foundation model. NeMo thoroughly tests Llama models to ensure that even though the results are not bit-wise matching, the overall performance (benchmarks) is on par.

If you have more concerns about the behavior, please provide us with more details. What model are you converting, what command are you using, and how large is the gap? We can check whether the gap is reasonable.

Thanks for your patient reply. We also test that the precission gab doesn't affect perfoemance.

ChencongZJU avatar May 16 '24 06:05 ChencongZJU