Encoder-decoder attention extraction in ASR transcribe
Is your feature request related to a problem? Please describe.
There are many studies showing that the encoder-decoder can be used for auxiliary tasks (e.g. with DTW to get word-level timestamps, or for simultanoeus/streaming translation -- such as with the StreamAtt method). However, there is currently no way to get the encoder-decoder attention when running the beam search. Thsi kind of feature is instead available in other repositories (e.g. HF transformers).
Describe the solution you'd like
It would be great if the beam search generation would return also the attention mask and this would be included in the returned Hypothesis.
Describe alternatives you've considered
An alternative might be an additional pass on the decoder at the end of the generation only to compute the attention, but this would add unnecessary computational costs.
Additional context
Add any other context or screenshots about the feature request here.
Hi Marco, thanks for the feature request. Will consider adding to upcoming releases.
You should be able to access cross attention values here https://github.com/NVIDIA-NeMo/NeMo/blob/0b1be8d1165f49ee2ef1e74f72f2ff07350f6798/nemo/collections/asr/modules/transformer/transformer_decoders.py#L288
Is this what you looking for?