Training loss gradients slow down with multiple calls in quick succession
Computing the gradients of a model training loss repeatedly in a loop apparently leads to a significant loss in performance compared to when the gradient calls are spaced out. This seems to be unique to GPflow - a similar test on a pure TensorFlow loss function didn't lead to the same behaviour.
To reproduce
Minimal, reproducible example
import time
import numpy as np
import tensorflow as tf
import gpflow
# Setup model
inputs = np.random.randn(1000, 5)
targets = np.random.randn(1000, 1)
kernel = gpflow.kernels.SquaredExponential()
model = gpflow.models.GPR((inputs, targets), kernel=kernel)
# Evaluate loss to initialise Tensorflow
model.training_loss()
# Repeated calls with minimal wait between - SLOW
print('Calls with minimal wait between:')
for _ in range(10):
t0 = time.time()
with tf.GradientTape() as g:
loss = model.training_loss()
dldp = g.gradient(loss, model.trainable_variables)
print('\tComputation took %.3f milliseconds' % ((time.time() - t0) * 1000))
time.sleep(0.02)
time.sleep(1)
# Repeated calls with longer wait between - FAST
print('\nCalls with longer wait between:')
for _ in range(10):
t0 = time.time()
with tf.GradientTape() as g:
loss = model.training_loss()
dldp = g.gradient(loss, model.trainable_variables)
print('\tComputation took %.3f milliseconds' % ((time.time() - t0) * 1000))
time.sleep(0.2)
Output Calls with minimal wait between: Computation took 28.922 milliseconds Computation took 146.606 milliseconds Computation took 135.637 milliseconds Computation took 128.655 milliseconds Computation took 129.651 milliseconds Computation took 128.657 milliseconds Computation took 127.656 milliseconds Computation took 129.652 milliseconds Computation took 130.649 milliseconds Computation took 128.654 milliseconds
Calls with longer wait between: Computation took 23.936 milliseconds Computation took 23.937 milliseconds Computation took 23.935 milliseconds Computation took 24.935 milliseconds Computation took 24.935 milliseconds Computation took 23.935 milliseconds Computation took 24.934 milliseconds Computation took 24.934 milliseconds Computation took 24.933 milliseconds Computation took 24.933 milliseconds
Expected behaviour
Subsequent gradient calls to take the (roughly) same length of time.
System information
- GPflow version: 2.0.5
- GPflow installed from: pip install gpflow
- TensorFlow version: v2.2.0
- Python version: 3.7.6
- Operating system: Windows 10
That is very odd. Can't think of anything off the top of my head that could cause it though, given that you're evaluating a sequence of tf operations...
For speed, it's recommended to use tf.function() to avoid doing everything in eager mode. Does the same thing happen then?
Thanks for suggestion, I have tried extracting,
@tf.function
def compute_gradient(model_):
with tf.GradientTape() as g:
loss_flow = model_.training_loss()
return g.gradient(loss_flow, model_.trainable_variables)
and calling this function in the loop but that doesn't make any difference (other than a minor speed-up on the quick evaluations).
I have tried the same code on three separate machines, each with different GPUs and got the same results so it does seem to be repeatable. Interestingly, if I increase the size of the training set then I have to increase the sleep time between loop iterations to see the unaffected times - perhaps the data is being moved/cleared in the background, which is affecting subsequent calls.
@a-mch I've just tried to reproduce this but couldn't (TensorFlow 2.2.0, but on Linux and on the CPU). I had to slightly adjust your minimal example (and updated your issue accordingly): Can you check again by copying that into a fresh file? Can you confirm whether this is an issue only on GPU or on CPU as well?
(NB- a function you wrap in tf.function() should not take complex objects such as the model as input!)
So looks like this is a GPU-only issue, and it still seems to be present with TF 2.3 and GPflow 2.1/develop. @a-mch what did you use as a "native TF" example for comparison where you say this issue isn't present? I'm suspecting it's got something to do with large matrices (GPR constructs an N x N matrix) getting shuffled to/from the GPU...
Hi, thanks for looking into it. I tried computing derivatives of a sq exp covariance function written with TensorFlow functions, like so,
import tensorflow as tf
import time
def compute(aa, bb, ll):
dist = ((aa[:, None, :] - bb[None, :, :]) / ll[None, None, :]) ** 2
return tf.reduce_sum(tf.exp(-tf.reduce_sum(dist, 2)))
def compute_gradient(a, b, l):
with tf.GradientTape() as g:
g.watch(l)
y = compute(a, b, l)
return g.gradient(y, l)
def run_test(num_repeats):
a = tf.random.normal((1000, 5))
b = tf.random.normal((1000, 5))
l = tf.random.normal((5, ))
# Initial call to initialise TensorFlow
dydl = compute_gradient(a, b, l)
# Call gradient tape in loop
for _ in range(num_repeats):
t0 = time.time()
dydl = compute_gradient(a, b, l)
len(dydl) # sanity check that it has actually computed the derivative
print('Derivative took %.3f milliseconds' % ((time.time() - t0) * 1000))
if __name__ == '__main__':
run_test(10)
Each iteration of the loop takes roughly the same time i.e. no slow down after the first iteration: Derivative took 2.991 milliseconds Derivative took 2.991 milliseconds Derivative took 2.993 milliseconds Derivative took 2.991 milliseconds Derivative took 3.991 milliseconds Derivative took 2.988 milliseconds Derivative took 2.032 milliseconds Derivative took 2.991 milliseconds Derivative took 2.000 milliseconds Derivative took 2.951 milliseconds
You are also correct that it the issue only appears on the GPU, if I run the code on the CPU then each loop iteration looks to take a similar amount of time.