The fp16 inference of pytorch encoder with different batchsize got different output
System and software fastertransformer version: v4.0 GPU: T4 CUDA: 11.0 PyTorch: 1.8
Issue description I have a fp16 BERT model with 12 fastertransformer encoders. When I do inference with the same input, the outputs of the model are different with different batchsize. The difference is around the order of 1e-3.
batchisize=1 part of the 12-CustomEncoder-output: 0.4175 -0.4004 1.26 -0.03006 0.1311 1.4375
batchisize=3 part of the 12-CustomEncoder-output: 0.416 -0.3975 1.264 -0.0305 0.1283 1.439
How to get deterministic results with different batch?
This is an expected behavior because cublas may choose different algorithm to compute the gemm when the shape are different.
Close this bug because it is inactivated. Feel free to re-open this bug if you still have any problem.