FasterTransformer The fp16 inference of pytorch encoder with different batchsize got different output

System and software fastertransformer version: v4.0 GPU: T4 CUDA: 11.0 PyTorch: 1.8

Issue description I have a fp16 BERT model with 12 fastertransformer encoders. When I do inference with the same input, the outputs of the model are different with different batchsize. The difference is around the order of 1e-3.

batchisize=1 part of the 12-CustomEncoder-output: 0.4175 -0.4004 1.26 -0.03006 0.1311 1.4375

batchisize=3 part of the 12-CustomEncoder-output: 0.416 -0.3975 1.264 -0.0305 0.1283 1.439

How to get deterministic results with different batch?

Aug 11 '22 02:08 woskii

This is an expected behavior because cublas may choose different algorithm to compute the gemm when the shape are different.

Aug 11 '22 02:08 byshiue

Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.

Dec 02 '22 14:12 byshiue