transformers Evaluate trainer on Code-Switched Speech fails with "ValueError: Multiple languages detected when trying to predict the most likely target language for transcription."

System Info

transformers version: 4.41.0.dev0
Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.1.dev0
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): 2.13.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi @ArthurZucker @muellerzr

This issue is related to finetuning Whisper on datasets that may contain switches from a base language to other languages, or simply low resource languages for which language identification by the pre-trained model is not accurate enough. So the issue may be reproduced by mixing a few audio utterances from French into a German dataset, for example, and running "trainer.evaluate" on it .

Up until transformers version 4.37.2, fine-tuning and evaluating on these types of datasets did not raise any issues and the fine-tuning result was very acceptable. In more recent versions, starting with 4.38.0, model evaluation systematically fails on such datasets (in transformers/models/whisper/generation_whisper.py)

I can understand the idea of forcing a single language in a batch, but in real-life situations, people use many languages concurrently in their daily interactions and this is reflected in the datasets. However, this issue prohibits fine-tuning for languages such as Luxembourgish, where it is frequent to mix Luxembourgish with English, French or German in the same utterances. Many other cases concerns Spanglish or Hinglish cases, or low resource languages borrowing words or phrases from other high-resource languages. So, it could prevent using the transformers library to fine-tune for such languages.

The only workaround that I have at the moment, is to stick to version 4.37.2 . Please have a look at this regression.

Thank you in advance!

Here is the full error code and messages:

`--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_12853/1263219524.py in 1 # Get initial evaluation results ----> 2 trainer.evaluate()

~/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix, **gen_kwargs) 178 self.gather_function = self.accelerator.gather 179 self._gen_kwargs = gen_kwargs --> 180 return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) 181 182 def predict(

~/.local/lib/python3.10/site-packages/transformers/trainer.py in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix) 3513 3514 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop -> 3515 output = eval_loop( 3516 eval_dataloader, 3517 description="Evaluation",

~/.local/lib/python3.10/site-packages/transformers/trainer.py in evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix) 3696 3697 # Prediction step -> 3698 loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) 3699 main_input_name = getattr(self.model, "main_input_name", "input_ids") 3700 inputs_decode = self._prepare_input(inputs[main_input_name]) if args.include_inputs_for_metrics else None

~/.local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py in prediction_step(self, model, inputs, prediction_loss_only, ignore_keys, **gen_kwargs) 308 k: v for k, v in inputs.items() if k not in ("decoder_input_ids", "decoder_attention_mask") 309 } --> 310 generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs) 311 312 # Temporary hack to ensure the generation config is not initialized for each iteration of the evaluation loop

~/.local/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py in generate(self, input_features, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, return_timestamps, task, language, is_multilingual, prompt_ids, prompt_condition_type, condition_on_prev_tokens, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, num_segment_frames, attention_mask, time_precision, return_token_timestamps, return_segments, return_dict_in_generate, **kwargs) 528 529 # pass self.config for backward compatibility --> 530 init_tokens = self._retrieve_init_tokens( 531 input_features, 532 generation_config=generation_config,

_~/.local/lib/python3.10/site-packages/transformers/models/whisper/generation_whisper.py in _retrieve_init_tokens(self, input_features, generation_config, config, num_segment_frames, kwargs) 1167 1168 if torch.unique(lang_ids).shape[0] > 1: -> 1169 raise ValueError( 1170 "Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing language='...' or make sure all input audio is of the same language." 1171 )

ValueError: Multiple languages detected when trying to predict the most likely target language for transcription. It is currently not supported to transcribe to different languages in a single batch. Please make sure to either force a single language by passing language='...' or make sure all input audio is of the same language.`_

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Run : trainer.evaluate() on a dataset containing a mix of languages.

Expected behavior

Works in transformers versions up to 4.37.2

May 04 '24 21:05 sproocht

cc @kamilakesbi

May 07 '24 09:05 amyeroberts

Hi @sproocht,

Thanks for sharing this error! It will be solved with PR #29688.

May 10 '24 08:05 kamilakesbi

Hi @kamilakesbi, Perfect! Thank you for confirming and for working on this. Best regards,

May 11 '24 11:05 sproocht

Hey @sproocht - thanks for reporting! This issue was in-fact closed by #29938 for the Transformers example, and https://github.com/huggingface/blog/pull/1944 for the blog post.

If you copy the latest example script and use the latest version of Transformers, you should be able to force the language token by setting the --language argument, which will bypass the automatic language detection.

Hope that helps!

May 16 '24 13:05 sanchit-gandhi

Hey @sproocht - I battle-tested this a bit and found you're indeed correct, the generation config is still not correctly updated. This PR should fix this once and for all: #30865

May 16 '24 15:05 sanchit-gandhi

Hey @sanchit-gandhi, That's great! Thank you for the updates. I look forward to testing the fix once the PR is merged.

May 16 '24 15:05 leophill

Hey @sanchit-gandhi, Nice job! Thanks for confirming. I will definitely give it a try after the PR is merged. Best regards,

May 16 '24 15:05 sproocht