What does this PR do?

Adding support for audio classification within TensorFlow whisper model

Fixes #21777

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[X] Did you read the contributor guideline, Pull Request section?
[X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@sanchit-gandhi

Mar 11 '23 17:03 adit299

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Mar 11 '23 17:03 HuggingFaceDocBuilderDev

I just had a few questions on how to proceed with adding the TensorFlow Whisper model, just to make sure I'm on the right track.

(1) Just so that I am clear on what the task is asking for, I need to recreate what is being done in PR #21754, except in TensorFlow. So, more specifically recreate the WhisperForAudioClassification class in TensorFlow, within the modeling_tf_whisper.py file.

(2) I see that there are a lot of additional lines of code within PR #21754 in various files that seem to be "registering" that the Whisper model now supports audio classification. Would I have to add any lines of code similar to this within my PR? Is there any documentation I can take a look at to learn more about this? (or anything that would help me understand more about this task in general)

@sanchit-gandhi

Mar 15 '23 20:03 adit299

Hi @adit299 Thanks for opening this PR - excited to have this implemented in TF!

Regarding your questions:

Yes, exactly.
Yes, the other (equivalent TF) additions will also need to be added. Some of the additions in #21754 are automatically generated e.g. those in dummy_pt_objects.py. There's an in-depth guide to adding TensorFlow models here which should cover the process. Let us know if there's anything missing or unclear.

Mar 16 '23 14:03 amyeroberts

Super cool @adit299! Feel free to ping us if you have any more questions / queries! More than happy to help with the integration here!

Mar 17 '23 15:03 sanchit-gandhi

Hello,

Just wanted to check in and provide an update. I have finished adding the TFWhisperForAudioClassification class within the modeling_tf_whisper.py file. One question regarding this:

(1) Within the modeling_tf_auto.py file I don't see any OrderedDict named TF_MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES (or any OrderedDict that is equivalent to the MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES present within the modeling_auto.py file). I was wondering where the TFWhisperForAudioClassification class should go within the modeling_tf_auto.py file.

I will continue work on developing the model tester, and will post any issues I run into here.

@sanchit-gandhi

Apr 04 '23 15:04 adit299

@adit299 - that's great news on the update!

For the auto mapping, if the tensorflow equivalent TF_MODEL_FOR_XXX doesn't exist, then it can be added to modeling_tf_auto.py. This means this is the first audio classification model to be added for TensorFlow 🔥🔥🔥

Apr 04 '23 17:04 amyeroberts

Recently, we merged TensorFlow Wav2Vec2 For Sequence Classification: https://github.com/huggingface/transformers/pull/22073

You could propagate the modelling code changes form this PR onto Whisper as a quick way of getting this working @adit299 (as we do for the PyTorch code)

Apr 26 '23 16:04 sanchit-gandhi

By propagate, do you mean just looking at that PR and using the code written for that task as help for this current task? If so, I have already been doing that. If you are referring to some other procedure please do let me know about this as I am not aware. That would certainly help!

Questions I had:

(1) I noticed that within the Pytorch implementation of the whisper tests, it refers to a class GenerationTesterMixin which does not seem to have a similarly named Tensorflow equivalent. Would I have to add this class? I am also confused about what these classes are doing (ex. what is TFModelTesterMixin doing, etc.), so any clarification you can provide is appreciated!

https://github.com/huggingface/transformers/blob/d204aea7314217fa8b47e7418ead0d9973f50ccd/tests/models/whisper/test_modeling_tf_whisper.py#L926

(2) I was having trouble with translating the test_encoder_outputs method in TensorFlow. Mainly these lines:

https://github.com/huggingface/transformers/blob/d204aea7314217fa8b47e7418ead0d9973f50ccd/tests/models/whisper/test_modeling_tf_whisper.py#L963-L966

Again, a bit confused about what model.to(torch_device) is doing. I will look into this a bit more, but again any clarifications about what this method is doing would help.

Thanks again for the speedy responses! @sanchit-gandhi @amyeroberts

Apr 27 '23 01:04 adit299

@adit299 By propagate, we mean apply the equivalent changes from the Wav2Vec2 PR to this PR - it won't be a direct copy-paste, but there will be large proportions in common. It's sounds like this is what you're doing, which is great :)

With respect to your questions:

GenerationTesterMixin

Yes, I don't think this class exists yet and you wouldn't have to add this class as part of this PR. Is there anything that should be added for the TF model tests @gante ?

In terms of what these classes are doing, the mixin classes group together related functionality e.g. common tests that should be added to all models. For example, TFModelTesterMixin contains tests for the TensorFlow models. This way we can create other classes using a composition of mixins.

.to and .eval methods model.to(...) is a pytorch specific method. See docs here: https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=#torch.nn.Module.to. It's moving the model onto the specified torch device. model.eval() is also a PyTorch method: https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=#torch.nn.Module.to.

May 04 '23 16:05 amyeroberts

@amyeroberts there is no generation-specific test mixin for TF. TFModelTesterMixin has some basic generate checks :)

May 04 '23 17:05 gante

Looks cool already @adit299! Let us know if you need a hand with the integration or when you'd like a PR review 🤗

May 09 '23 16:05 sanchit-gandhi

Thanks for the follow-up @sanchit-gandhi. Currently, I am debugging some of the test failures that I am getting. I also see that 7 more tests within TFModelTesterMixin are failing, but I thought I would resolve the tests failing within the TFWhisperEncoderModelTest class first before moving on to that.

This is the error occuring when test_encoder_outputs is run:

self = <tests.models.whisper.test_modeling_tf_whisper.TFWhisperEncoderModelTest testMethod=test_encoder_outputs>

    def test_encoder_outputs(self):
        config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()

        for model_class in self.all_model_classes:
            model = model_class(config)

            inputs = copy.deepcopy(self._prepare_for_class(inputs_dict, model_class))

>           with tf.stop_gradient:
E           AttributeError: __enter__

tests/models/whisper/test_modeling_tf_whisper.py:975: AttributeError

I believe this error is occuring since TensorFlow's stop_gradient implementation has no enter method defined (https://stackoverflow.com/questions/51427729/python-error-attributeerror-enter). I figured this is the closest equivalent to torch.no_grad, used in the PyTorch implementation which is why I used it. If you could let me know a little bit more about what this method is testing and how it works, I think I will be able to solve the error.

On I sidenote, I also see the methods freeze_encoder, get_input_embeddings, and set_input_embeddings within the Pytorch implementation. Would I have to implement these as well? What are these methods doing? @amyeroberts

May 11 '23 22:05 adit299

@adit299 Yes, these methods should also be implemented for the TF model. You can look at similar TF implementations to see how this was done e.g. here for freezing a module.

May 15 '23 16:05 amyeroberts

I would say probably we don't need freezing since this is only relevant for fine-tuning, and we don't have a seq2seq ASR fine-tuning script in TF (related https://github.com/huggingface/transformers/pull/22109#discussion_r1194040076)

May 16 '23 16:05 sanchit-gandhi

Hey @adit299 - feel free to comment here when this PR is ready for review and we can take a look! Seems to be close to completion

Jun 12 '23 16:06 sanchit-gandhi

Hey @sanchit-gandhi, apologies for the delay! Yes, this PR is ready for review. I haven't had much luck in getting some tests to pass however. I appreciate any help you guys can provide by taking a look.

Jun 12 '23 18:06 adit299

@adit299 Unfortunately, diving into people's PRs to debug isn't something we can do as it's just not a scalable solution with a repo of this size. If you need help from us, then please share a detailed description of the issue, what you've tried already and ideally highlighting any relevant pieces of code.

Jun 13 '23 13:06 amyeroberts

Understandable, @amyeroberts . There are 5 tests failing right now. Here is all the information requested (to the best of my knowledge):

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_compile_tf_model

Error -

E       TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification).
E
E       call() got an unexpected keyword argument 'decoder_input_ids'
E
E       Call arguments received by layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification):
E         • input_features={'input_features': 'tf.Tensor(shape=(2, 80, 59), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'}
E         • head_mask=None
E         • encoder_outputs=None
E         • labels=None
E         • output_attentions=None
E         • output_hidden_states=None
E         • return_dict=None

../../../src/transformers/modeling_tf_utils.py:434: TypeError

What I tried -

I suspected it had something to do with:

https://github.com/adit299/transformers/blob/3d3c7d4213e08d69254edb9c04ac28b3dfbd40f4/tests/test_modeling_tf_common.py#L739C4-L819

But that doesn't seem to be the case. Maybe the Whisper decoder is being mistakenly invoked? I am just not sure.

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_hidden_states_output - AssertionError: Lists differ: [30, 16] != [60, 16]

Error -

../../test_modeling_tf_common.py:1002: in check_hidden_states_output
    self.assertListEqual(
E   AssertionError: Lists differ: [30, 16] != [60, 16]
E
E   First differing element 0:
E   30
E   60
E
E   - [30, 16]
E   ?  ^
E
E   + [60, 16]
E   ?  ^

The assertion failing is:

 self.assertListEqual(
                    list(hidden_states[0].shape[-2:]),
                    [self.model_tester.seq_length, self.model_tester.hidden_size],
                )

What I tried - Not sure about this one.

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_pt_tf_model_equivalence - AttributeError: tf_whisper_encoder_17.conv1.weight not found in PyTorch model

Error -

E               AttributeError: tf_whisper_encoder_17.conv1.weight not found in PyTorch model

../../../src/transformers/modeling_tf_pytorch_utils.py:322: AttributeError

What I tried - Not sure about this one as well

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - NotImplementedError

Error - ../../../src/transformers/modeling_tf_utils.py:1343: NotImplementedError

What I tried - I think this one is out of my control

FAILED test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load - TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_20' (type TFWhisperForAudioClassification

What I tried - connected to the first error, solving that should solve this.

Please do let me know if any other clarification is needed! Apologies for the long post!

Jun 19 '23 01:06 adit299

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 14 '23 15:07 github-actions[bot]

Hi @adit299, thanks for giving more details about debugging the tests and apologies for the delay in my response.

I suggest looking through the artefacts from the CI run, specifically failure_long.txt as they will give you a more detailed error message an trackback to help figure out the issues.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_compile_tf_model I think your suspicions are correct. You'll need to add a new branch in the if/else logic to create the correct inputs for this model.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_hidden_states_output In this case it seems the sequence length of the hidden size doesn't match what's expected. I would create a model using the test config and check its architecture and the hidden states outputs when passed a dummy input.

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_pt_tf_model_equivalence I looks like a weight is in the TF model and not in the PT model. I'd check the params in each model - looking at tf_model.trainable_parameters() and pt_model.state_dict() to see if you can identify if this is a case of a weight not being loaded, or name not properly matched.

If you create the TF whisper model with pytorch weights, do you get any warnings about weights being randomly initialized?

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - NotImplementedError

This is raised because the model doesn't have a get_input_embeddings method implemented

test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load

From the CI artefacts, it looks like this is failing because of decoder_input_ids being in the input

Jul 14 '23 18:07 amyeroberts

Hello,

Apologies for the delay. I am attempting to instantiate an instance of the TFWhisperForAudioClassification model to debug some of the issues I'm having. So, I try to run this:

>>> from transformers import TFWhisperForAudioClassification

I end up getting this error:

RecursionError: maximum recursion depth exceeded while calling a Python object

Which stems from these lines of code:

https://github.com/huggingface/transformers/blob/080a97119c0dabfd0fb5c3e26a872ad2958e4f77/src/transformers/models/auto/auto_factory.py#L701-L707

When I run a debugger, the problematic statement is:

https://github.com/huggingface/transformers/blob/080a97119c0dabfd0fb5c3e26a872ad2958e4f77/src/transformers/models/auto/auto_factory.py#L705

Just executing self._model_mapping.keys() on its own results in the RecursionError.

I have been trying to see what is causing this, but I'm at a loss. Is this why you suggest creating the model using a test config? Could you show how to do that if it is relevant to avoiding this error? I contemplated increasing the Recursion Depth on my machine (its currently at 1000), but I'm hesitant to think that would solve it.

Thanks again for your patience, I realize I'm quite the n00b :sweat_smile:

@amyeroberts @sanchit-gandhi

Aug 07 '23 22:08 adit299

Hello,

I am currently attempting to resolve the error:

Error -

E       TypeError: Exception encountered when calling layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification).
E
E       call() got an unexpected keyword argument 'decoder_input_ids'
E
E       Call arguments received by layer 'tf_whisper_for_audio_classification_4' (type TFWhisperForAudioClassification):
E         • input_features={'input_features': 'tf.Tensor(shape=(2, 80, 59), dtype=float32)', 'decoder_input_ids': 'tf.Tensor(shape=(1, 2), dtype=int32)'}
E         • head_mask=None
E         • encoder_outputs=None
E         • labels=None
E         • output_attentions=None
E         • output_hidden_states=None
E         • return_dict=None

../../../src/transformers/modeling_tf_utils.py:434: TypeError

Since this error is the root cause of several of the tests failing. I think the issue is that TFWhisperForAudioClassification inherits from the class TFWhisperPreTrainedModel, which has the following methods:

https://github.com/huggingface/transformers/blob/50573c648ae953dcc1b94d663651f07fb02268f4/src/transformers/models/whisper/modeling_tf_whisper.py#L464-L498

I believe the dummy_inputs method is introducing decoder_input_ids into the input. By commenting out a couple of lines:

@property
    def dummy_inputs(self) -> Dict[str, tf.Tensor]:
        """
        Dummy inputs to build the network.

        Returns:
            `Dict[str, tf.Tensor]`: The dummy inputs.
        """
        return {
            self.main_input_name: tf.random.uniform(
                [1, self.config.num_mel_bins, self.config.max_source_positions * 2 - 1], dtype=tf.float32
            ),
            # "decoder_input_ids": tf.constant([[1, 3]], dtype=tf.int32),
        }

    @property
    def input_signature(self):
        return {
            "input_features": tf.TensorSpec((None, self.config.num_mel_bins, None), tf.float32, name="input_features"),
            # "decoder_input_ids": tf.TensorSpec((None, None), tf.int32, name="decoder_input_ids"),
            "decoder_attention_mask": tf.TensorSpec((None, None), tf.int32, name="decoder_attention_mask"),
        }

The number of tests failing reduces to 4. Although, obviously, this introduces new errors (I have attached the new errors at the bottom for reference). The pytorch equivalent to this method does not contain the dummy_inputs and input_signature method :

https://github.com/huggingface/transformers/blob/50573c648ae953dcc1b94d663651f07fb02268f4/src/transformers/models/whisper/modeling_whisper.py#L654-L682

My questions are:

(1) Should I attempt to change the TensorFlow PreTrainedMethod to be similar to the Pytorch implementation?

or

(2) Is there some better way to proceed?

Once this is resolved, I am very close to finishing with this pull request. Thanks again for your patience! @amyeroberts @sanchit-gandhi

New Errors:

FAILED tests/models/whisper/test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_resize_token_embeddings - ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.
FAILED tests/models/whisper/test_modeling_tf_whisper.py::TFWhisperEncoderModelTest::test_save_load - AssertionError: 5.524128 not less than or equal to 1e-05

Aug 28 '23 16:08 adit299

@adit299 dummy_inputs and input_signature are methods unique to the tensorflow models and aren't needed in the pytorch implementation.

TFWhisperForAudioClassification should implement its own dummy_inputs and input_signature which override the methods it inherits from TFWhisperPreTrainedModel.

I'm going to be away mid-September to mid-October. If you have any other tensorflow specific questions, or questions about the differences between the TF and PT models, please ping @Rocketknight1 in my absence.

Aug 29 '23 15:08 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 29 '23 08:09 github-actions[bot]

Add tensor flow whisper model for audio classification

What does this PR do?

Before submitting

Who can review?