pytorch with mixed precision training is much slower than native float32?
My hardware is: RTX3090+AMD 3900X+128G RAM The software is: Graphic Driver 455.23.05 with cuda 11.1, pytorch 1.7 , python 3.8 on debian 10. I installed apex from github clone with "pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./" command line. pytorch are installed in anaconda.
I followed the toturial that add the initializing code with:
opt_level = 'O1'
model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
and replace the loss.backward()with:
with amp.scale_loss(loss_gpu, optimizer) as scaled_loss:
scaled_loss.backward()
the code can run with a warning 'gradient overflow. skipping step loss scaler 0 reducing loss scale to', and the memory do reduce to about a half. But the training time per epochs is even longer than native float32 training with 80s in native mode and 120+s in mixed precision training.
So please tell me where doses it go run? The usage of apex or the hardware issue? Many thanks!
Same problem I came across! Puzzled!
Same problem. My hardware is: V100 The software is: Graphic Driver 455.23.05 with cuda 10.1, pytorch 1.4 , python 3.6 on ubuntu18.04 My hardware is: A100 The software is: Graphic Driver 455.23.05 with cuda 11, pytorch 1.7 , python 3.6 on ubuntu18.04
Same problem! DDP+mixed precision RTX3090, Graphic Driver 455.38, CUDA11, pytorch1.7, python 3.8 on ubuntu18.04
Same problem with mixed precision here with a CNN on a Linux system with a Nvidia 4060 TI, Cuda 12.8, Python 3.11, torch 2.6.0, latest Nvidia driver
Mixed Precision with PyTorch is slower than FP32 by a factor of 1.35 than native float32.
Just for a comparison : Keras 3.9.1, Tensorflow 2.19, same CNN model, same input data => performance improvement (!!) by a factor of 2.8 with mixed precision - by just one Keras command.