Not able to observe any speedup on a Nvidia T4 (Turing arch)
I trained a model with the fast mixed precision using amp, pytorch on a RTX 2070. I am now trying to run inference on the network on a NVIDIA T4 and I am not able to observe any speedup. On checking the state_dict of the checkpoint it looks like they are all Torch.Tensors and not Torch.HalfTensor. I can see that the size of the model is smaller in Mixed Precision version than FP32. In order to run inference I just loaded the amp state dict and did an amp initialize. Is there anything else I had to do in order for it to run in FP16?
I used the inbuilt pytorch profiler for a forward pass. All operations are consistently taking longer with FP16 ( apex ) than FP32. The quantity I compared was CUDA time avg. Given that the model size has reduced but time has increased leads me to think there is some other step I had to do to enforce T4 to perform these operations in FP16. Any help in that aspect would greatly benefit me. Thank you.
Second that. I see that apex O3 setting is slower than torch.half() at inference time by a few ms
@mkolod is there anything I am missing here?
Any update on this issue? Still have same issue in 2024.