NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Encoder-decoder attention extraction in ASR transcribe

Open mgaido91 opened this issue 4 months ago • 1 comments

Is your feature request related to a problem? Please describe.

There are many studies showing that the encoder-decoder can be used for auxiliary tasks (e.g. with DTW to get word-level timestamps, or for simultanoeus/streaming translation -- such as with the StreamAtt method). However, there is currently no way to get the encoder-decoder attention when running the beam search. Thsi kind of feature is instead available in other repositories (e.g. HF transformers).

Describe the solution you'd like

It would be great if the beam search generation would return also the attention mask and this would be included in the returned Hypothesis.

Describe alternatives you've considered

An alternative might be an additional pass on the decoder at the end of the generation only to compute the attention, but this would add unnecessary computational costs.

Additional context

Add any other context or screenshots about the feature request here.

mgaido91 avatar Oct 21 '25 09:10 mgaido91

Hi Marco, thanks for the feature request. Will consider adding to upcoming releases.

nithinraok avatar Dec 09 '25 21:12 nithinraok

You should be able to access cross attention values here https://github.com/NVIDIA-NeMo/NeMo/blob/0b1be8d1165f49ee2ef1e74f72f2ff07350f6798/nemo/collections/asr/modules/transformer/transformer_decoders.py#L288

Is this what you looking for?

nithinraok avatar Dec 11 '25 14:12 nithinraok

Hi @mgaido91,

You can find an example of this functionality in the streaming Canary decoding PR: link. We are using cross-attention outputs to guide the streaming decoding process for AlignAtt policy. You can easily extract the cross-attention weights as it is done here: link1, link2.

andrusenkoau avatar Dec 11 '25 17:12 andrusenkoau