Problem when using `NCPNormalOutput`with mini-batches
Environment
I installed Edward2 with tf-nightly as dependency
pip install edward2[tf-nightly]@"git+https://github.com/google/edward2.git#egg=edward2"
which set up the following dependencies:
tensorflow==2.4.0.dev20200926
tensorflow-probability==0.12.0.dev20200926
edward2==0.0.2
...
Python version is 3.7.9.
Problem
I tried to run the NCP example that is part of the documentation in noise.py (with minor additions to get a runnable program):
import edward2 as ed
import tensorflow as tf
batch_size, dataset_size = 128, 1000
# some random data
features = tf.random.normal((dataset_size, 25))
labels = tf.random.normal((dataset_size, 1))
inputs = tf.keras.layers.Input(shape=(25,))
x = ed.layers.NCPNormalPerturb()(inputs) # double input batch
x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)
means = ed.layers.DenseVariationalDropout(1, activation=None)(x) # get mean
means = ed.layers.NCPNormalOutput(labels)(means) # halve input batch
stddevs = tf.keras.layers.Dense(1, activation='softplus')(x[:batch_size])
outputs = tf.keras.layers.Lambda(lambda x: ed.Normal(x[0], x[1]))([means, stddevs])
model = tf.keras.Model(inputs=inputs, outputs=outputs)
optimizer = tf.optimizers.Adam(learning_rate=1e-3)
# Run training loop.
num_steps = 1000
for _ in range(num_steps):
with tf.GradientTape() as tape:
predictions = model(features)
loss = -tf.reduce_mean(predictions.distribution.log_prob(labels))
loss += model.losses[0] / dataset_size # KL regularizer for output layer
loss += model.losses[-1]
trainable_vars = model.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
optimizer.apply_gradients(zip(gradients, trainable_vars))
and ran into:
ValueError: Arguments `loc` and `scale` must have compatible shapes; loc.shape=(1000, 1), scale.shape=(128, 1).
That's clear because the training loop runs full-batch updates and stddevs = tf.keras.layers.Dense(1, activation='softplus')(x[:batch_size]) only uses the first batch_size elements. But that's not related to my main question. Changing the training loop to use mini-batches
...
ds = tf.data.Dataset.from_tensor_slices((features, labels)).batch(batch_size)
# Run training loop.
num_steps = 1000
for i in range(num_steps):
print(i)
for features_batch, labels_batch in ds:
with tf.GradientTape() as tape:
predictions = model(features_batch)
loss = tf.reduce_mean(predictions.distribution.log_prob(labels_batch))
...
fixes the above problem but introduces a new problem in NCPNormalOutput:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [1000,1] vs. [128,1] [Op:SquaredDifference]
i.e. the shape of labels passed as constructor argument to NCPNormalOutput is incompatible with the shape of the mini-batch. Does NCPNormalOutput (when centering at the labels) not support mini-batch updates at the moment?
Semi-related: layers NCPNormalPerturb and NCPNormalOutput are only needed during training. At test/prediction
time they seem to have no influence on the result. So why is NCP-related functionality designed as layers at all? Shouldn't this be a concern of the loss function only? Edit: Ok, I see that NCPNormalOutput creates a distribution from its input and samples from that distribution, hence has an influence on the result. Nevertheless, this behavior doesn't seem to be related to NCPs, so my previous question remains.