get the decoder hidden states after decoding
After I pass an explicit output layer like here, I see that that decoder outpus after dynamic_decode is the output distribution of size |V| where V is the vocab.
How can I recover the decoder hidden states ?
A follow up question: In tf.contrib.seq2seq.BasicDecoder, the outut_layer parameter is optional. Then, while doing Greedy Decoding, if I don't pass any value for that parameter, will it perform argmax on RNN hidden states and pass the output to the next time step decoder input (which is actually unintended) ?
Need attention, please!
@rajarsheem Yes, the outputs of the dynamic_decode in NMT codebase is the vocab logits.
If you don't give BasicDecoder an output_layer and using GreedyEmbeddingHelper, I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id.
Not passing the output_layer is useful for using other helper, such as, ScheduledOutputTrainingHelper
You may implement a custom helper that takes the output layer, generate the vocab logits within the helper, and return both vocab logits and hidden states.
It's kind of strange that the hidden states are, by default, not exposed :/
"I think it will use RNN hidden states as the logits, and argmax on the hidden state to try to get a word id." It looks very undesirable but then I don't know why output_layer is not provided in the decoded here. Is it okay to leave like this -- using the hidden state argmax to pick next time step's input ?
@rajarsheem
During training, we can apply the output_layer after all time steps finished here because we have the word ids in target language. So the outputs here contains rnn outputs (which is the h state when using LSTM).
During inference, we have to pass the rnn outputs through the output layer at each time step to get the next word id.
@oahziur I don't get your first point. How can we compute the hidden states of all steps in the first place without using the output layer and taking the argmax to feed in next step input ?
In other words, how are we computing the outputs here (which is actually rnn states) without needing to feed output layer argmax (we cannot feed hidden state argmax, can we?)
@rajarsheem we don't feed hidden state argmax because we have the target ids. See how the TrainingHelper is created https://github.com/tensorflow/nmt/blob/master/nmt/model.py#L373.
Yeah, I get your point. But if I am not using teacher forcing (or using GreedyEmbedingHelper), I would want my predicted ids to be used. And for that to happen, I would be needing the output layer to be used as a part of the decoder.
@rajarsheem
Yes, the code you referenced in the last comment is only for teacher forcing during training, so that's why the output_layer is not being used.
So, I need to hack my way into it to use output_layer as a part of the decoder and also make the dynamic_decode return hidden states. Any suggestions about what should be the flow?
@rajarsheem
Yes, I think you can implement a custom GreedyEmebddingHelper (which accepts an output layer), so you don't need to pass-in the output layer to the BasicDecoder.
For example, You can insert code before here to convert the rnn_outputs to logits.
This is what I did: added a new attribute final_output in BasicDecoderOutput namedtuple that shall store projected outputs whenever there is an output_layer in BasicDecoder.
In the step() of BasicDecoder, final_outputs which is linearly transformed cell_outputs is what going inside sample and also sent as a parameter to outputs which is essentially a BasicDecoderOutput and is returned. Few other changes were there.
Consequently, when dynamic_decode is returning a BasicDecoderOutput, it already has an attribute final_output that has the unnormalized logits and rnn_output being the cell output.
@oahziur @rajarsheem Could you help with a similar issue #298 ? Thanks.
+1 because users may have use cases where the decoder's outputs are needed without being passed through the final feedforward layer.
Example: the decoder uses scheduled sampling, so the dense layer is needed, but the user wants to use sampled softmax and hence needs the rnn outputs without being passed through the dense layer.
@rajarsheem I had to create an adapted version of BasicDecoder for similar reasons.
Often one wants an RNN to output not only logits but also some additional loss terms or summary metrics. The functionality of theoutput_layer is to extract logits from a cell output which is necessary for the mechanics in the step function. However, if an output_layer is present, right now step then still only returns the logits (unfortunately called cell_outputs as well) rather than the original outputs.