transformers [WIP] - Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size

This PR aims at fixing issue #30611:

First: an error will be thrown if the assistant and main models encoders don't have the same size, and the assistant is loaded using AutoModelForCausalLM.
Second: This PR makes the pipeline work when using an assistant with a different encoder size (loaded with AutoModelForCausalLM) than the main model:

When using AutomaticSpeechRecognitionPipeline, If we use an assistant with a different encoder size then the main model , the pipeline is broken and we get the following error message:

ValueError: Whisper expects the mel input features to be of length 3000, but found 1500. Make sure to pad the input mel features to 3000.

Explanation of the solution

When doing short form generation with the pipeline, input_features aren't passed to the generate method, which instead takes the output of the main model's encoder.

If the main model and the assistant don't share the same encoder, the encoder_output passed to generate cannot be used by the assistant for generation, and we get an error.

The solution here is to also pass the input_features to the generate method to be used by the assistant.

Who can review?

@sanchit-gandhi

May 03 '24 14:05 kamilakesbi

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

May 03 '24 15:05 HuggingFaceDocBuilderDev

Is there a test that confirms correctness after the fix? There's likely a relevant slow pipeline test that was either failing, or was not rigorous enough

May 08 '24 10:05 sanchit-gandhi

Please have a look at #30726 for an alternative fix -- IMO, the root source of problems is the ASR pipeline doing a redundant operation, and not in generate :)

I hope you don't mind me crashing into the issue 🙌 (I only noticed this PR after opening #30726, when trying to link all related issues)

May 09 '24 12:05 gante

I think this PR is ready to be merged!

cc @amyeroberts @gante if you want to have a look ;)

May 14 '24 09:05 kamilakesbi

Thanks @gante for the review :) @amyeroberts could you please merge this PR ?

May 20 '24 12:05 kamilakesbi

@kamilakesbi It still needs a final core maintainer review and approval before merge (+ resolution of conflicts) :). I'll review now

May 20 '24 16:05 amyeroberts

hi @amyeroberts, I've added a slow test which pass :) I think quality checks fails are unrelated to this PR, and will be fixed by #30932.

May 21 '24 15:05 kamilakesbi

Quick checklist before merge:

[ ] Resolve all comment threads that have been addressed
[ ] Fix the merge conflict in automatic_speech_recognition.py
[ ] Rebase onto main to get the style fixes from #30932 (once the PR is merged)
[ ] Ping me to get this merged!

May 21 '24 16:05 sanchit-gandhi

Nice work @kamilakesbi!

May 23 '24 08:05 sanchit-gandhi