NeMo Encoder-decoder attention extraction in ASR transcribe

Is your feature request related to a problem? Please describe.

There are many studies showing that the encoder-decoder can be used for auxiliary tasks (e.g. with DTW to get word-level timestamps, or for simultanoeus/streaming translation -- such as with the StreamAtt method). However, there is currently no way to get the encoder-decoder attention when running the beam search. Thsi kind of feature is instead available in other repositories (e.g. HF transformers).

Describe the solution you'd like

It would be great if the beam search generation would return also the attention mask and this would be included in the returned Hypothesis.

Describe alternatives you've considered

An alternative might be an additional pass on the decoder at the end of the generation only to compute the attention, but this would add unnecessary computational costs.

Additional context

Add any other context or screenshots about the feature request here.

Oct 21 '25 09:10 mgaido91

Hi Marco, thanks for the feature request. Will consider adding to upcoming releases.

Dec 09 '25 21:12 nithinraok

You should be able to access cross attention values here https://github.com/NVIDIA-NeMo/NeMo/blob/0b1be8d1165f49ee2ef1e74f72f2ff07350f6798/nemo/collections/asr/modules/transformer/transformer_decoders.py#L288

Is this what you looking for?

Dec 11 '25 14:12 nithinraok

Hi @mgaido91,

You can find an example of this functionality in the streaming Canary decoding PR: link. We are using cross-attention outputs to guide the streaming decoding process for AlignAtt policy. You can easily extract the cross-attention weights as it is done here: link1, link2.

Dec 11 '25 17:12 andrusenkoau