Question about coverage mechanism implementation

Open iamxpy opened this issue 5 years ago • 0 comments

I am trying to figure out the implementation of coverage mechanism, and after debug for a while, I still cannot understand why is the procedure of producing coverage vector in decode mode NOT the same as that in training/eval mode.

Related code is here: this line

Note that this attention decoder passes each decoder input through a linear layer with the previous step's context vector to get a modified version of the input. If initial_state_attention is False, on the first decoder step the "previous context vector" is just a zero vector. If initial_state_attention is True, we use initial_state to (re)calculate the previous step's context vector. We set this to False for train/eval mode (because we call attention_decoder once for all decoder steps) and True for decode mode (because we call attention_decoder once for each decoder step).

IMHO, the training and decode procedures would mismatch to some extend in such an implementation (Please correct me if I am wrong).

For example:

Let H be all encoder hidden states (a list of tensors), then,

In training/eval mode, every decode step use attention network only once:

Input: H, current_decoder_hidden_state, previous_coverage(None for the first decode step)

Output: next coverage, next context and attention weights( i.e. attn_dist in the code).

In decode mode, every step will apply attention mechanism twice:

(1) The first time:

Input: H, previous_decoder_hidden_state, previous_coverage (0s for the first decode step)

Output: modified previous context and next coverage (discard attention weights here)

(2) The second time:

Input: H, current_decoder_hidden_state, next coverage

Output: next context, attention weights (DO NOT update next coverage here)

Apr 14 '20 17:04 iamxpy