ncnn NaN on ARM FP16, but Successful on x86

The model consists of operations like ['ExpandDims', 'Split', 'Convolution1D', 'Permute', 'UnaryOp', 'BinaryOp', 'MemoryData', 'MatMul', 'Clip', 'Reduction', 'Reshape', 'Gemm', 'Concat', 'Slice', 'Softmax', 'LayerNorm', 'GELU', 'InnerProduct'].

On x86_64 Linux, ncnn works perfectly with fp16. However, on ARM, the final results show NaN. When I disable the option by setting net.opt.use_fp16_storage = false, it works fine, but there is additional latency.

As an alternative, I tested onnx-runtime on ARM, and it runs as fast as ncnn with fp16, providing correct results.

Does onnx-runtime optimize the model better, or could there be another reason? I'm aiming to make ncnn work with fp16 but am uncertain how to debug the issue. Any suggestions would be greatly appreciated. Do you have any insights?

Thank you for your assistance.

Jan 25 '24 18:01 oewill

Hi, please provide the problematic model files (param and bin)

You can also extract the intermediate blobs and observe which operator caused the NaN result

Mar 28 '24 11:03 nihui

针对onnx模型转换的各种问题，推荐使用最新的pnnx工具转换到ncnn In view of various problems in onnx model conversion, it is recommended to use the latest pnnx tool to convert your model to ncnn

pip install pnnx
pnnx model.onnx inputshape=[1,3,224,224]

详细参考文档 Detailed reference documentation https://github.com/pnnx/pnnx https://github.com/Tencent/ncnn/wiki/use-ncnn-with-pytorch-or-onnx#how-to-use-pnnx

Aug 05 '24 09:08 nihui

NaN on ARM FP16, but Successful on x86_64 Linux