Whisper: Decode with condition_on_previous_text=False
Feature request
Whisper speech recognition without conditioning on previous text. As in https://github.com/openai/whisper/blob/7858aa9c08d98f75575035ecd6481f462d66ca27/whisper/transcribe.py#L278
Motivation
Whisper implementation is great however conditioning the decoding on previous text can cause significant hallucination and repetitive text, e.g.:
"Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice? Do you have malpractice?"
Running openai's model with --condition_on_previous_text False drastically reduces hallucination
@ArthurZucker
cc @sanchit-gandhi we could add this as a generation config argument, and in the prepare_inputs_for_generation can just remove all the input_ids if asked. WDYT?
My question is more about the usage/quality tradeoff but doesn't seem like something hard to maintain.
Quality is much better without conditioning on previous text https://github.com/openai/whisper/discussions/679#discussioncomment-4449150 Similarly whisperx requires this because theres just too much hallucination otherwise
just remove all the input_ids if asked
Yes trying this, fairly straightforward, but not when batch_size > 1
Since each sample in the batch resets at different indexes (when there is a pair of consecutive timestamps). A lot of the methods are nested quite deep so it's taking me a while to sift through, but seems like the best approach, given this variable length prompt per batch, would be to supply an attention mask to the decoder ?
Or just pad according to the variable length
I think the attention mask is the best way to get what you want indeed. Padding can also work, as it should create the attention mask of the padding and pass it to the network. I think it makes sense to add this, we just have to keep it to Whisper, so either in the modeling file, or a new logit processor 😉 I won't have time to do this until at least a week, do you want to open a PR and ping me for pointers and reviews? 🤗
Hacked attempt here, seems to work on my end -- can now run very fast whisper without hallucination :') https://github.com/huggingface/transformers/pull/21491/commits/cf2ad49fae43e8355655c5392d4dca0bdd1a733e
Super cool feature! Thanks for the PR @m-bain! Reviewed directly there!
Hi there, I was looking into this issue in some detail and I'm not sure this is relevant for the 🤗 Transformers implementation of Whisper, since it never actually conditions on the previous text.
It's true that the OpenAI implementation does this, but the Transformers ASR pipeline treats all 30-second chunks of audio independently. It never passes in the previous context when doing the predictions.
The chunks do overlap partially, so they do have some duplicate tokens, but this overlap is pretty small — not nearly as large as the context provided by the OpenAI implementation. And even if condition_on_previous_text = False in the OpenAI code, they still add the output from the previous chunk to the context, which is actually longer than the small bit of overlap used by our pipeline.
In any case, I will look a bit closer at your PR in the coming days to see exactly what it does. Perhaps it is still an improvement that we could use. 😃
So this would mean we would support conditioning on previous text by adding the sequential processing on the PR 😄
Great point @hollance, shall we keep this open if we have sequential processing on the roadmap?
Leaving this open as it could be relevant for https://github.com/huggingface/transformers/issues/23231#issuecomment-1545684559
Doing this only makes sense if we decide to support a sequential pipeline, and I think we weren't really in favor of this?
Right now, there is no conditioning on previous text going on (except for explicit prompting, once that PR is merged, which you have to enable manually by passing in prompt tokens).
I don't think so, no. Running the sequential pipeline is just the same as the original repo, so I struggle to see what the appeal to the user is here vs our batched method (feels like a step backwards).
Let's close this one then?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@hollance if all of the chunks are processed independently as you say, why it happens quite often that the model starts to repeat itself after the short segments were processed? E.g actual audio is "and who"
@vsokolovskii Do you have a code snippet and audio file that can reproduce this problem?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Leaving this one closed unless we identify a use case that suggests a need for the sequential pipeline. Note that adding a stride between chunks should alleviate any mis-match between them (you can read more about this here).
I stumbled upon this thread after benchmarking insanely-fast-whisper, seamless-m4t-v2, faster-whisper and the hugginface implementation of whisper based on transformers pipeline with bettertransformers. I found a bug related to the this thread or rather parameter.
I was able to reproduce the papers with the Fleurs Dataset, however I went a stop further and wanted to benchmark longer textes. So I concatenated the Fleurs files and Transcriptions accordingly, deleted any duplicate sentences and tested it in 30sec, 1min, 5min and 30min chunks on the models. I used english language.
Long Story short: In terms of scoring faster whisper was the only model which was able to reproduce the scores from the original Fleurs Dataset, every other model degraded.
Insanely Fast Whisper and HF Whisper had a WER of 0.14 (0.7 on original Fleurs), seamles-v2 completely broke down to a WER of 0.5. Why?
Because faster whisper was the only model, which supported condition_on_previous_text=False as a parameter. I think it would be very beneficial to give users the option to use this, especially for the new seamles m4t-v2 model, because even with 30-60s chunks the performance is already at 0.5 for english language.
I know that a reason for this is that these sentences are concatenated without sharing any semantic background, but models shouldn't be falling apart that severly, because of some implementation decision.