[WIP] - Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size
This PR aims at fixing issue #30611:
-
First: an error will be thrown if the assistant and main models encoders don't have the same size, and the assistant is loaded using
AutoModelForCausalLM. -
Second: This PR makes the pipeline work when using an assistant with a different encoder size (loaded with
AutoModelForCausalLM) than the main model:
When using AutomaticSpeechRecognitionPipeline, If we use an assistant with a different encoder size then the main model , the pipeline is broken and we get the following error message:
ValueError: Whisper expects the mel input features to be of length 3000, but found 1500. Make sure to pad the input mel features to 3000.
Explanation of the solution
When doing short form generation with the pipeline, input_features aren't passed to the generate method, which instead takes the output of the main model's encoder.
If the main model and the assistant don't share the same encoder, the encoder_output passed to generate cannot be used by the assistant for generation, and we get an error.
The solution here is to also pass the input_features to the generate method to be used by the assistant.
Who can review?
@sanchit-gandhi
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Is there a test that confirms correctness after the fix? There's likely a relevant slow pipeline test that was either failing, or was not rigorous enough
Please have a look at #30726 for an alternative fix -- IMO, the root source of problems is the ASR pipeline doing a redundant operation, and not in generate :)
I hope you don't mind me crashing into the issue 🙌 (I only noticed this PR after opening #30726, when trying to link all related issues)
I think this PR is ready to be merged!
cc @amyeroberts @gante if you want to have a look ;)
Thanks @gante for the review :) @amyeroberts could you please merge this PR ?
@kamilakesbi It still needs a final core maintainer review and approval before merge (+ resolution of conflicts) :). I'll review now
hi @amyeroberts, I've added a slow test which pass :) I think quality checks fails are unrelated to this PR, and will be fixed by #30932.
Quick checklist before merge:
- [ ] Resolve all comment threads that have been addressed
- [ ] Fix the merge conflict in
automatic_speech_recognition.py - [ ] Rebase onto main to get the style fixes from #30932 (once the PR is merged)
- [ ] Ping me to get this merged!
Nice work @kamilakesbi!