a-PyTorch-Tutorial-to-Image-Captioning icon indicating copy to clipboard operation
a-PyTorch-Tutorial-to-Image-Captioning copied to clipboard

multi layer RNN

Open WoodySG2018 opened this issue 6 years ago • 4 comments

coded multi layer RNN. Using 2 layers as default, can change "num_layers" to adjust number of layers when calling DecoderWithAttention. Please check.

WoodySG2018 avatar Jul 03 '19 09:07 WoodySG2018

Does it achieve improved results when compared to a single layer one? If yes, by what margin?

kmario23 avatar Jul 23 '19 11:07 kmario23

Does it achieve improved results when compared to a single layer one? If yes, by what margin?

Hi kmario23, the training is on-going, by now I saw clear different loss decreasing between 2-layer and 4-layer RNN (I can't trace back single layer RNN performance), but it's too early to tell relation of performance vs #layer on my working dataset.

Theoretically to capture highly hierarchical structure by just one layer is not optimal. So hopefully by integrating multilayer here we can enable users to test the model's power in another dimension on their own datasets.

WoodySG2018 avatar Jul 24 '19 03:07 WoodySG2018

Is there any update on this issue? Will it be merged? A very nice functionality, I would say.

Just a quick question related to how initial states for multi-layer LSTM are constructed: I am wondering why we need to initialize hidden/cell states of the LSTM by passing image representation through FC layer? Is it somehow the standard way of doing it?

nilinykh avatar Jan 06 '20 18:01 nilinykh

Just a quick question related to how initial states for multi-layer LSTM are constructed: I am wondering why we need to initialize hidden/cell states of the LSTM by passing image representation through FC layer? Is it somehow the standard way of doing it?

That's how we pass the learned representation of the image by the CNN (encoder) to the RNN/LSTM (decoder). Usually the decoder dimension varies in size (i.e. a hyperparameter). So, to match the encoder dimension to the decoder dimension, we usually have to project the learned representation of image i.e. the encoder output, which is of dimension 2048 in this case, to the decoder dimension. Else it's not possible to feed the encoder output directly into the decoder.

kmario23 avatar Jan 07 '20 15:01 kmario23