interaction_network_pytorch icon indicating copy to clipboard operation
interaction_network_pytorch copied to clipboard

The model massively overfits

Open mys007 opened this issue 7 years ago • 2 comments

Hi,

thanks a lot for releasing a 3rd party implementation of the paper. Neverthless, I'm afraid there is some problem with your code or at least the hyperparameters are not chosen well. This can be shown by looking at the validation error as follows:

n_epoch = 100
batches_per_epoch = 100

data_test = gen(n_objects, True)

losses = []
losses_test = []
for epoch in range(n_epoch):
    for _ in range(batches_per_epoch):
        objects, sender_relations, receiver_relations, relation_info, target = get_batch(data, 30)
        predicted = interaction_network(objects, sender_relations, receiver_relations, relation_info)
        loss = criterion(predicted, target)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        losses.append(np.sqrt(loss.data[0]))

    objects, sender_relations, receiver_relations, relation_info, target = get_batch(data_test, 30)
    predicted = interaction_network(objects, sender_relations, receiver_relations, relation_info)
    losses_test.append(np.sqrt(criterion(predicted, target).data[0]))

    clear_output(True)
    plt.figure(figsize=(20,5))
    plt.subplot(131)
    plt.title('Epoch %s RMS Train Error %s' % (epoch, np.mean(losses[-100:])))
    plt.plot(losses)
    plt.subplot(132)
    plt.title('Epoch %s RMS Test Error %s' % (epoch, np.mean(losses_test[-100:])))
    plt.plot(losses_test)        
    plt.show()

I got train RMS 2.3 but validation RMS 209.6. Update: This is mostly because you train on just a single "scene", so that the network actually never sees any other masses and thus cannot generalise to those.

mys007 avatar Sep 12 '18 15:09 mys007

Hi @mys007 and @higgsfield . By chance have you get further with this?

I wonder how this network is used once trained. This is my attempt:

n_steps = len(data)

# Preallocate space for pos and velocity and get the first value. 
speed_predictions = torch.from_numpy(np.zeros_like(data[:, :, 3:]))
pos_predictions = torch.from_numpy(np.zeros_like(data[:, :, 1:3]))
prev_state = torch.Tensor(data[0:1]).cuda()

with torch.no_grad():
    for ii in range(n_steps):
        speed_prediction = interaction_network(prev_state, sender_relations_1, receiver_relations_1, relation_info_1)
        pos_prediction = prev_state[0, :, 1:3] + speed_prediction * diff_t
        
        speed_predictions[ii] = speed_prediction
        pos_predictions[ii] = pos_prediction
        
        prev_state[0, :, 1:3] = pos_prediction
        prev_state[0, :, 3:] = speed_prediction

ii = 1
plt.plot(data[:, ii, 1], data[:, ii, 2], label='real')
plt.plot(pos_predictions[:, ii, 0], pos_predictions[:, ii, 1], label='real')

In a few words, since we are predicting the speed, I assume that we need to build the next state by adding the speed * time to he previous position. But like this, the result is awful...

Since I'm using the same data used for training, if it is completely overfitting this result should be good.

rpicatoste avatar Aug 14 '19 15:08 rpicatoste

Hi @rpicatoste . I haven't been working on this since. But just to give you two tips for debugging :

  • You can have a look if even the first prediction is right. If not exactly, then the integrated measurements likely aren't part of the training set.
  • You can try to modify the code to predict positions instead of speed.

mys007 avatar Aug 21 '19 23:08 mys007