apex Not able to observe any speedup on a Nvidia T4 (Turing arch)

I trained a model with the fast mixed precision using amp, pytorch on a RTX 2070. I am now trying to run inference on the network on a NVIDIA T4 and I am not able to observe any speedup. On checking the state_dict of the checkpoint it looks like they are all Torch.Tensors and not Torch.HalfTensor. I can see that the size of the model is smaller in Mixed Precision version than FP32. In order to run inference I just loaded the amp state dict and did an amp initialize. Is there anything else I had to do in order for it to run in FP16?

Oct 09 '19 00:10 aditya1709

I used the inbuilt pytorch profiler for a forward pass. All operations are consistently taking longer with FP16 ( apex ) than FP32. The quantity I compared was CUDA time avg. Given that the model size has reduced but time has increased leads me to think there is some other step I had to do to enforce T4 to perform these operations in FP16. Any help in that aspect would greatly benefit me. Thank you.

Oct 09 '19 02:10 aditya1709

Second that. I see that apex O3 setting is slower than torch.half() at inference time by a few ms

Oct 09 '19 19:10 SrivastavaKshitij

@mkolod is there anything I am missing here?

Oct 22 '19 16:10 aditya1709

Any update on this issue? Still have same issue in 2024.

Apr 28 '24 16:04 jo7ueb