apex pytorch with mixed precision training is much slower than native float32?

My hardware is: RTX3090+AMD 3900X+128G RAM The software is: Graphic Driver 455.23.05 with cuda 11.1, pytorch 1.7 , python 3.8 on debian 10. I installed apex from github clone with "pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./" command line. pytorch are installed in anaconda.

I followed the toturial that add the initializing code with:

opt_level = 'O1'
    model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)

and replace the loss.backward()with:

 with amp.scale_loss(loss_gpu, optimizer) as scaled_loss:
                    scaled_loss.backward()

the code can run with a warning 'gradient overflow. skipping step loss scaler 0 reducing loss scale to', and the memory do reduce to about a half. But the training time per epochs is even longer than native float32 training with 80s in native mode and 120+s in mixed precision training.

So please tell me where doses it go run? The usage of apex or the hardware issue? Many thanks!

Nov 14 '20 17:11 lincchenl

Same problem I came across! Puzzled!

Nov 21 '20 18:11 ttjjmm

Same problem. My hardware is: V100 The software is: Graphic Driver 455.23.05 with cuda 10.1, pytorch 1.4 , python 3.6 on ubuntu18.04 My hardware is: A100 The software is: Graphic Driver 455.23.05 with cuda 11, pytorch 1.7 , python 3.6 on ubuntu18.04

Dec 07 '20 02:12 lixincn2015

Same problem! DDP+mixed precision RTX3090, Graphic Driver 455.38, CUDA11, pytorch1.7, python 3.8 on ubuntu18.04

Jan 05 '21 10:01 lawpdas

Same problem with mixed precision here with a CNN on a Linux system with a Nvidia 4060 TI, Cuda 12.8, Python 3.11, torch 2.6.0, latest Nvidia driver

Mixed Precision with PyTorch is slower than FP32 by a factor of 1.35 than native float32.

Just for a comparison : Keras 3.9.1, Tensorflow 2.19, same CNN model, same input data => performance improvement (!!) by a factor of 2.8 with mixed precision - by just one Keras command.

Apr 04 '25 14:04 eremo