DNC-tensorflow New task for adding scalar values (0 or 1)

Common Settings

The model is trained on 2-layer feedforward controller (with hidden sizes 128 and 256 respectively) with the following set of hyperparameters:

RMSProp Optimizer with learning rate of 10⁻⁴, momentum of 0.9.
Memory word size of 10, with a single read head.
A batch size of 1.
input_size = 3.
output_size = 1.
sequence_max_length = 100.
words_count = 15.
word_size = 10.
read_heads = 1.

A square loss function of the form: (y - y_)**2 is used. Where both 'y' and 'y_' are scalar numbers.

The input is a (1, random_length, 3) tensor, where the 3 is for a one-hot encoding vector of size 3, where:

010 is a '0' 100 is a '1' 001 is the end mark

So, and example of an input of length 10 will be the next 3D-tensor:

[[[ 0. 1. 0.] [ 0. 1. 0.] [ 0. 1. 0.] [ 1. 0. 0.] [ 0. 1. 0.] [ 0. 1. 0.] [ 1. 0. 0.] [ 0. 1. 0.] [ 0. 1. 0.] [ 0. 0. 1.]]]

This input is a represenation of a sequence of adding 0 or 1 values in the form of:

0 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + (end_mark)

The target outoput is a 3D-tensor with the result of this adding task. In the example above:

[[[2.0]]]

The DNC output is a 3D-tensor of shape (1, random_length, 1). For example:

[[[ 0.45] [ -0.11] [ 1.3] [ 5.0] [ 0.5] [ 0.1] [ 1.0] [ -0.5] [ 0.33] [ 0.12]]]

The target output and the DNC output are both then reduced with tf.reduce_sum() so we end up with two scalar values. For example:

Target_output: 2.0 DNC_output: 5.89

And we apply then the square loss function:

loss = (Target_o - DNC_o)**2

and finally the gradient update.

Results

The model is going to recieve as input a random length sequence of 0 or 1 values like:

Input: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1

Then it will return a scalar value for this input adding proccess. For example, the DNC will output something like: 3.98824. This value will be the predicted result for the input adding sequence (we are going to truncate the integer part of the result):

DNC prediction: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]

Once we train the model with:

$python tasks/copy/train.py --iterations=50000

we can see that the model learns in less than 1000 iterations to compute this adding function, and the loss drop from:

Iteration 0/1000 Avg. Logistic Loss: 24.9968

to:

Iteration 1000/1000 Avg. Logistic Loss: 0.0076

It seems like the DNC model is able to learn this pseudo-code:

function(x): if (x == [ 1. 0. 0.]) return (near) 1.0 (float values) else return (near) 0.0 (float values)

Generalization test

We use for the model a sequence_max_length = 100, but in the training proccess we use just random length sequences up to 10 (sequence_max_length/10). Once the train is finished, we let the trained model to generalize to random length sequences up to 100 (sequence_max_length).

Results show that the model successfully generalize the adding task even with sequence 10 times larger than the training ones.

These are real data outputs:

Building Computational Graph ... Done! Initializing Variables ... Done!

Iteration 0/1000 Avg. Logistic Loss: 24.9968 Real value: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 5 Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 = 0 [0.000319847]

Iteration 100/1000 Avg. Logistic Loss: 5.8042 Real value: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 5 Predicted: 0 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 = 6 [6.1732]

Iteration 200/1000 Avg. Logistic Loss: 0.7492 Real value: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 9 Predicted: 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 = 8 [8.91952]

Iteration 300/1000 Avg. Logistic Loss: 0.0253 Real value: 0 + 1 + 1 = 2 Predicted: 0 + 1 + 1 = 2 [2.0231]

Iteration 400/1000 Avg. Logistic Loss: 0.0089 Real value: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 3 Predicted: 0 + 1 + 0 + 0 + 0 + 1 + 1 = 2 [2.83419]

Iteration 500/1000 Avg. Logistic Loss: 0.0444 Real value: 1 + 0 + 1 + 1 = 3 Predicted: 1 + 0 + 1 + 1 = 2 [2.95937]

Iteration 600/1000 Avg. Logistic Loss: 0.0093 Real value: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 4 Predicted: 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 = 3 [3.98824]

Iteration 700/1000 Avg. Logistic Loss: 0.0224 Real value: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 6 Predicted: 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 = 5 [5.93554]

Iteration 800/1000 Avg. Logistic Loss: 0.0115 Real value: 0 + 0 = 0 Predicted: 0 + 0 = -1 [-0.0118587]

Iteration 900/1000 Avg. Logistic Loss: 0.0023 Real value: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 5 Predicted: 1 + 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 = 4 [4.97147]

Iteration 1000/1000 Avg. Logistic Loss: 0.0076 Real value: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4Done!

Testing generalization...

Iteration 0/1000 Predicted: 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 = 4 [4.123]

Saving Checkpoint ... Real value: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6 Predicted: 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 0 = 6 [6.24339]

Iteration 1/1000 Real value: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11 Predicted: 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 = 11 [11.1931]

Iteration 2/1000 Real value: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 33 Predicted: 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 + 1 + 1 = 32 [32.9866]

Iteration 3/1000 Real value: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16 Predicted: 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 1 + 1 = 16 [16.1541]

Iteration 4/1000 Real value: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 44 Predicted: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 = 43 [43.5211]

Jan 07 '17 16:01 Zeta36

Impressive work! I'm certainly curious about how it was able to generalize with the same amount of memory locations!

What do you think about taking it up a notch? Let's remove that reduce_sum and see if it can learn to add on its own. Here's how I think it could go: your input sequence would go something like this: 1 + 0 + 0 + 1 + 1 + 1 + 0 + 1 + 0 = - , and your target output would be the scalar 5. instead of attempting to copy the sequence via adding, we make the task that at the step containing '-' the model should output the value of the summation! Your loss would be the squared difference between between the output at that step and your target output, the loss at all previous step is omitted (you can find the technique of omitting the loss on specific steps in the recently pushed bAbI task).

I've just pushed new updates to the code that include optimizations in both memory and execution time performance, so you would be able to leave it training for more iterations while doing this more quickly!

I'm looking forwrad to see your results with this!

Jan 14 '17 23:01 Mostafa-Samir

Hello, @Mostafa-Samir.

You can get the code of the adding task without the tf.resume_sum() in here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py.

But I'm afraid that removing the tf.reduce_sum() makes the model unable to generalize with success with a fixed memory size as before. In this new version of the code, the model is still able to learn to resolve any sequence of 0 and 1 sums, but it fails when we try to use the learned model to larger sequences than that used in the training process.

I think that's because the original version I pulled here make use of the tf.reduce_sum() as a way of accumulator. I think the model learns an algorithm like this:

function(X): for each x in X: if (x == [ 1. 0. 0.]) return (near) 1.0 (float value) else return (near) 0.0 (float value)

And later, the tf.reduce_sum() makes the correct sum over all the sequence output. The output will have a nearly 1 for each [ 1. 0. 0.] input vector, and a nearly 0 in other case, and finally the tf.reduce_sum() will give the correct answer no matter the large the input is. And I think is because this little "if else" f(x) algorithm is easy to learn that the model is able to generalize to unlimited large inputs X sequences with a fixed memory size.

As soon as we remove the tf.reduce_sum() like in the version I made following your instructions, this trick doesn't work and the model has to learn other more complex and less generalizable algorithm than the f(x) I told you before.

What do you think, @Mostafa-Samir?

Regards, Samu.

Jan 15 '17 15:01 Zeta36

Here you have a little excerpt of a real training result of the new version (https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/adding/train_v2.py): Iteration 800/1001

Avg. Cross-Entropy: 0.0231753 Avg. 100 iterations time: 0.03 minutes Approx. time to completion: 0.00 hours DNC input [[[ 0. 1. 0.] [ 1. 0. 0.] [ 0. 1. 0.] [ 1. 0. 0.] [ 0. 1. 0.] [ 0. 1. 0.] [ 0. 0. 1.]]] Text input: 1 + 0 + 1 + 0 + 1 + 1 = - Target_output [[[ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 4.]]] DNC output [[[ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 3.52538943]]] Real operation: 1 + 0 + 1 + 0 + 1 + 1 = 4 Predicted result: 1 + 0 + 1 + 0 + 1 + 1 = 4 [3.52539] ... ... Iteration 1000/1001 Avg. Cross-Entropy: 0.0046492 Avg. 100 iterations time: 0.03 minutes Approx. time to completion: 0.00 hours DNC input [[[ 0. 1. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 0. 0. 1.]]] Text input: 1 + 0 + 0 + 0 + 0 + 0 + 0 = - Target_output [[[ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 1.]]] DNC output [[[ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0.86268544]]] Real operation: 1 + 0 + 0 + 0 + 0 + 0 + 0 = 1 Predicted result: 1 + 0 + 0 + 0 + 0 + 0 + 0 = 1 [0.862685]

Iteration 1001/1001 Saving Checkpoint ... Done!

Testing generalization...

Iteration 0/1000 Real operation: 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 56 Predicted result: 1 + 0 + 1 + 0 + 0 + 1 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1 + 1 + 0 + 0 + 0 = 9316 [[ 9316.20117188]]

Iteration 1/1000 Real operation: 1 + 0 = 1 Predicted result: 1 + 0 = 1 [[ 0.853342]]

Iteration 2/1000 Real operation: 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 1 = 17 Predicted result: 1 + 0 + 1 + 1 + 1 + 1 + 0 + 0 + 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 0 + 1 = 74 [[ 73.88546753]]

Jan 15 '17 15:01 Zeta36

@Mostafa-Samir, due to the great improvement in the core of your DNC implementation I've developed another task for testing the project. I've made a model that successfully is able to learn a argmax function over a input.

The model is feed with a vector of onehot integer values, and the target output is the index inside the vector with the maximum value. I'm glad to say to you that your DNC is able to learn this function using just a feedforward controller, and even better, ¡is able to generalize to larger vectors of those used in the training process!

You can see my code here: https://github.com/Zeta36/DNC-tensorflow/blob/master/tasks/argmax/train_v2.py.

And here you can see some results: ... ... Iteration 9900/10001 Avg. Cross-Entropy: 0.1064857 Avg. 100 iterations time: 0.16 minutes Approx. time to completion: 0.00 hours DNC input [[[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] [ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]] Target_output [[[ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 1.]]] DNC output [[[ 0. ] [-0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 1.44688594]]] Real argmax(X): 1 Predicted f(X): 1

Iteration 10000/10001 Avg. Cross-Entropy: 0.0603415 Avg. 100 iterations time: 0.16 minutes Approx. time to completion: 0.00 hours DNC input [[[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]] Target_output [[[ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 0.] [ 5.]]] DNC output [[[ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 0. ] [ 4.93556786]]] Real argmax(X): 5 Predicted f(X): 5

Saving Checkpoint ... Done!

Testing generalization...

Iteration 0/10000 Real argmax(X): 3 Predicted f(X): 3

Iteration 1/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 2/10000 Real argmax(X): 4 Predicted f(X): 3

Iteration 3/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 4/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 5/10000 Real argmax(X): 3 Predicted f(X): 3

Iteration 6/10000 Real argmax(X): 1 Predicted f(X): 2

Iteration 7/10000 Real argmax(X): 3 Predicted f(X): 2

Iteration 8/10000 Real argmax(X): 6 Predicted f(X): 6

Iteration 9/10000 Real argmax(X): 5 Predicted f(X): 4

Iteration 10/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 11/10000 Real argmax(X): 5 Predicted f(X): 4

Iteration 12/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 13/10000 Real argmax(X): 0 Predicted f(X): 2

Iteration 14/10000 Real argmax(X): 2 Predicted f(X): 5

Iteration 15/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 16/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 17/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 18/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 19/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 20/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 21/10000 Real argmax(X): 4 Predicted f(X): 4

Iteration 22/10000 Real argmax(X): 10 Predicted f(X): 10

Iteration 23/10000 Real argmax(X): 6 Predicted f(X): 5

Iteration 24/10000 Real argmax(X): 1 Predicted f(X): 2

Iteration 25/10000 Real argmax(X): 4 Predicted f(X): 3

Iteration 26/10000 Real argmax(X): 1 Predicted f(X): 3

Iteration 27/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 28/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 29/10000 Real argmax(X): 3 Predicted f(X): 3

Iteration 30/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 31/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 32/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 33/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 34/10000 Real argmax(X): 6 Predicted f(X): 6

Iteration 35/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 36/10000 Real argmax(X): 5 Predicted f(X): 4

Iteration 37/10000 Real argmax(X): 0 Predicted f(X): 0

Iteration 38/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 39/10000 Real argmax(X): 3 Predicted f(X): 3

Iteration 40/10000 Real argmax(X): 4 Predicted f(X): 4

Iteration 41/10000 Real argmax(X): 6 Predicted f(X): 6

Iteration 42/10000 Real argmax(X): 15 Predicted f(X): 14

Iteration 43/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 44/10000 Real argmax(X): 2 Predicted f(X): 2

Iteration 45/10000 Real argmax(X): 11 Predicted f(X): 10

Iteration 46/10000 Real argmax(X): 3 Predicted f(X): 3

Iteration 47/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 48/10000 Real argmax(X): 1 Predicted f(X): 1

Iteration 49/10000 Real argmax(X): 13 Predicted f(X): 13 ... ...

I don't know how the model is able to figure out where has been the highest value in the sequence of onehot encoded input values but it does, and even is able to generalize this learned method to sequences double of the size used in the training process without more memory use. DeepMind has found something big with this DNC, and they are improving it with a sparse version able to use less resources: https://arxiv.org/pdf/1610.09027v1.pdf

Regards, Samu.

Jan 15 '17 20:01 Zeta36

Great work Samu @Zeta36 !

Regarding the adding task I have a comment about how you apply the wights to the loss. You use the following:

loss = tf.reduce_mean(tf.square((loss_weights * output) - ncomputer.target_output))

while you should be using:

loss = tf.reduce_mean(loss_weights * tf.square(output - ncomputer.target_output))

Remember, you're weighting the contribution of the loss of each step not the significance of each step on its own. Mathematically it's written as

$\mathcal{L}(y,\hat{y})=\mathbb{E}_t\left[w_t\left(y_t - \hat{y}_t \right )^2 \right ]$

not

$\mathcal{L}(y,\hat{y})=\mathbb{E}_t\left[\left(w_ty_t - \hat{y}_t \right )^2 \right ]$

I don't really know how you generate the output vector, but the 1st formulation can easily overestimate your loss value.

Try to adopt this change and see if it has any effect on the model. You should also try to test the generalization of the adding by using the same trained model but with larger memory matrix (more locations) just as you can find in the visualization notebook of the copy task. It'd also be a good idea to separate the generalization tests into different scripts than the training one, and try to use a single descriptive statistic (like the percentage of correct answers, or the percentage of error or whatever you decide) to describe your results so instead of dumping the entire log in the README you can just add one or two examples from the log and describe your results with that statistic!

I'll be happy then to merge your contributions to repo!

Jan 18 '17 18:01 Mostafa-Samir

Hi @Zeta36 and @Mostafa-Samir , I am really excited about the results of your tasks and about the DNC's potentiality.

For this reason, I am trying to implement a further task by myself. I am interested in understanding if a DNC can solve it. I would really appreciate any feedback from you, thanks.

Task description

The task is to count the total number of repeated numbers in a list.

For example:

Input: [ 1, 2, 3] 
Output: [0]

Input: [ 1, 2, 3, 2, 4, 1, 5]
                  X     X     : Repetitions
Output: [2]

The pseudo code the DNC should learn is:

function(x, seenNumbers):
   if x in seenNumbers:
       return 1
   else:
       return 0

I am wondering if the DNC can manage by itself the seenNumber list.

Settings

Assuming that the DNC can solve the task (I suppose a simple LSTM net can), I would structure the data as follows:

Input: (1, random length, 1) tensor
Output: Either (1, random length, 1) tensor or scalar containing the sum of the repetitions
Loss: depending on the output structure, a square loss function element by element or between two scalars
DNC parameters: Currently it is obscure to me how to set the memory parameters (word size, number of words)

What do you think? Do you think that it would be feasible for the DNC to solve the task?

Thanks, Alessandro

Sep 10 '19 13:09 cornagli