A variant of the deep LSTM?
In the page 7 of the article, the author write:
we have used the following variant of the deep LSTM architecture
But in your code, you are using LSTM model defined in Chain framework. So I wonder whether your model is the exactly representation of the model mentioned in the article. I am not arguing with you, just want to discuss this with you.
I don't think the variant of the LSTM is different from the original version except for the input X_t is designed to include memory readings :)
If you take a look at the first page of METHODS part, you can see that the formulation of input gate includes three input, X_t, h_{t-1}^l and h_t^{l-1}. I think the input gate of original LSTM only use X_t and h_{t-1}^l. So I have that problem.
@Seraphli I think it is just the right expression for LSTM in a network with several hidden layers. The lth-layer cell gets input from its past (h_{t-1}^l), its lower layer (h_t^{l-1}) and the sample (X_t). You may also refer to Eq. (11) in [Graves, A., Mohamed, A.-r. & Hinton, G. Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (eds Ward, R. et al.) 6645–6649 (Curran Associates, 2013).] for an explicit expression for the LSTM, which is the same as in the Nature paper :)
where is the dataset?