transformers Returning n-best hypotheses from Wav2Vec2ProcessorWithLM decoder

Feature request

Currently, the Wav2Vec2ProcessorWithLM decode function returns only the best hypothesis. Shall we extend its functionality and make it return n-best hypotheses, logit_scores, lm_scores, word_offsets so that people could rescore these hypotheses with a larger LM.

For example, take a look at NeMo article regarding the rescoring of n-best hypotheses.

Motivation

I suppose many people use n-gram models during the shallow fusion stage, the n-grams models are a good fit during the beam search because they are fast. People perform the rescoring of the n-best hypotheses with a larger LM (using them during the decoding is too slow so it makes sense to apply them during the rescoring of n-best hypotheses that come out of the ASR system). They fuse the score which comes out of the ASR with the perplexity-like score from the LM. If this external model is trained on the domain data it will drastically improve the WER of the resulting model.

Your contribution

If it sounds like a good feature to you, that can be potentially adopted let me know and I'll prepare the PR 😃

Mar 14 '23 08:03 vsokolovskii

cc @sanchit-gandhi

Mar 14 '23 13:03 sgugger

please, tell me your opinion on this feature :)

Mar 15 '23 18:03 vsokolovskii