transformers Whisper: Decode with condition_on_previous

Feature request

Whisper speech recognition without conditioning on previous text. As in https://github.com/openai/whisper/blob/7858aa9c08d98f75575035ecd6481f462d66ca27/whisper/transcribe.py#L278

Motivation

Whisper implementation is great however conditioning the decoding on previous text can cause significant hallucination and repetitive text, e.g.:

"Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice?"

Running openai's model with --condition_on_previous_text False drastically reduces hallucination

@ArthurZucker

Feb 06 '23 08:02 m-bain

cc @sanchit-gandhi we could add this as a generation config argument, and in the prepare_inputs_for_generation can just remove all the input_ids if asked. WDYT? My question is more about the usage/quality tradeoff but doesn't seem like something hard to maintain.

Feb 06 '23 12:02 ArthurZucker

Quality is much better without conditioning on previous text https://github.com/openai/whisper/discussions/679#discussioncomment-4449150 Similarly whisperx requires this because theres just too much hallucination otherwise

just remove all the input_ids if asked

Yes trying this, fairly straightforward, but not when batch_size > 1

Since each sample in the batch resets at different indexes (when there is a pair of consecutive timestamps). A lot of the methods are nested quite deep so it's taking me a while to sift through, but seems like the best approach, given this variable length prompt per batch, would be to supply an attention mask to the decoder ?

Or just pad according to the variable length

Feb 06 '23 18:02 m-bain

I think the attention mask is the best way to get what you want indeed. Padding can also work, as it should create the attention mask of the padding and pass it to the network. I think it makes sense to add this, we just have to keep it to Whisper, so either in the modeling file, or a new logit processor 😉 I won't have time to do this until at least a week, do you want to open a PR and ping me for pointers and reviews? 🤗

Feb 07 '23 07:02 ArthurZucker

Hacked attempt here, seems to work on my end -- can now run very fast whisper without hallucination :') https://github.com/huggingface/transformers/pull/21491/commits/cf2ad49fae43e8355655c5392d4dca0bdd1a733e

Feb 07 '23 13:02 m-bain

Super cool feature! Thanks for the PR @m-bain! Reviewed directly there!

Feb 10 '23 14:02 sanchit-gandhi

Hi there, I was looking into this issue in some detail and I'm not sure this is relevant for the 🤗 Transformers implementation of Whisper, since it never actually conditions on the previous text.

It's true that the OpenAI implementation does this, but the Transformers ASR pipeline treats all 30-second chunks of audio independently. It never passes in the previous context when doing the predictions.

The chunks do overlap partially, so they do have some duplicate tokens, but this overlap is pretty small — not nearly as large as the context provided by the OpenAI implementation. And even if condition_on_previous_text = False in the OpenAI code, they still add the output from the previous chunk to the context, which is actually longer than the small bit of overlap used by our pipeline.

In any case, I will look a bit closer at your PR in the coming days to see exactly what it does. Perhaps it is still an improvement that we could use. 😃

Apr 05 '23 16:04 hollance

So this would mean we would support conditioning on previous text by adding the sequential processing on the PR 😄

Apr 11 '23 08:04 ArthurZucker

Great point @hollance, shall we keep this open if we have sequential processing on the roadmap?

Apr 21 '23 13:04 sanchit-gandhi

Leaving this open as it could be relevant for https://github.com/huggingface/transformers/issues/23231#issuecomment-1545684559

May 16 '23 16:05 sanchit-gandhi

Doing this only makes sense if we decide to support a sequential pipeline, and I think we weren't really in favor of this?

Right now, there is no conditioning on previous text going on (except for explicit prompting, once that PR is merged, which you have to enable manually by passing in prompt tokens).

May 17 '23 12:05 hollance

I don't think so, no. Running the sequential pipeline is just the same as the original repo, so I struggle to see what the appeal to the user is here vs our batched method (feels like a step backwards).

Let's close this one then?

May 17 '23 17:05 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 11 '23 15:06 github-actions[bot]

@hollance if all of the chunks are processed independently as you say, why it happens quite often that the model starts to repeat itself after the short segments were processed? E.g actual audio is "and who""okay" but the model will output "and who and who okay" even though there were 2 separate segments?

Jun 15 '23 10:06 vsokolovskii

@vsokolovskii Do you have a code snippet and audio file that can reproduce this problem?

Jun 15 '23 13:06 hollance

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 09 '23 15:07 github-actions[bot]

Leaving this one closed unless we identify a use case that suggests a need for the sequential pipeline. Note that adding a stride between chunks should alleviate any mis-match between them (you can read more about this here).

Jul 25 '23 14:07 sanchit-gandhi

I stumbled upon this thread after benchmarking insanely-fast-whisper, seamless-m4t-v2, faster-whisper and the hugginface implementation of whisper based on transformers pipeline with bettertransformers. I found a bug related to the this thread or rather parameter.

I was able to reproduce the papers with the Fleurs Dataset, however I went a stop further and wanted to benchmark longer textes. So I concatenated the Fleurs files and Transcriptions accordingly, deleted any duplicate sentences and tested it in 30sec, 1min, 5min and 30min chunks on the models. I used english language.

Long Story short: In terms of scoring faster whisper was the only model which was able to reproduce the scores from the original Fleurs Dataset, every other model degraded.

Insanely Fast Whisper and HF Whisper had a WER of 0.14 (0.7 on original Fleurs), seamles-v2 completely broke down to a WER of 0.5. Why?

Because faster whisper was the only model, which supported condition_on_previous_text=False as a parameter. I think it would be very beneficial to give users the option to use this, especially for the new seamles m4t-v2 model, because even with 30-60s chunks the performance is already at 0.5 for english language.

I know that a reason for this is that these sentences are concatenated without sharing any semantic background, but models shouldn't be falling apart that severly, because of some implementation decision.

Jan 08 '24 14:01 asusdisciple

Whisper: Decode with condition_on_previous_text=False

Feature request

Motivation