transformers Whisper - list index out of range with word level timestamps

System Info

transformers version: 4.42.2
Platform: Windows-10-10.0.22621-SP0
Python version: 3.10.14
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.2
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA GeForce RTX 4070 Laptop GPU

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
Setup model config with return_timestamps="word", amongst other settings.
Run audio through the pipe, which results in error

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) asr_out = transcribing_pipe(
      [2](vscode-notebook-cell:?execution_count=8&line=2)     '../SampleData/Saba_interview_short.wav',
      [3](vscode-notebook-cell:?execution_count=8&line=3)     return_timestamps="word",
      [4](vscode-notebook-cell:?execution_count=8&line=4)     generate_kwargs={"language": "danish"}
      [5](vscode-notebook-cell:?execution_count=8&line=5)     )
      [7](vscode-notebook-cell:?execution_count=8&line=7) asr_out

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:284, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
    [221](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:221) def __call__(
    [222](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:222)     self,
    [223](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:223)     inputs: Union[np.ndarray, bytes, str],
    [224](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:224)     **kwargs,
    [225](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:225) ):
    [226](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:226)     """
    [227](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:227)     Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
    [228](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:228)     documentation for more information.
   (...)
    [282](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:282)                 `"".join(chunk["text"] for chunk in output["chunks"])`.
    [283](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:283)     """
--> [284](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:284)     return super().__call__(inputs, **kwargs)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\base.py:1246, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   [1244](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1244)     return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   [1245](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1245) elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> [1246](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1246)     return next(
   [1247](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1247)         iter(
   [1248](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1248)             self.get_iterator(
   [1249](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1249)                 [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   [1250](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1250)             )
   [1251](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1251)         )
   [1252](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1252)     )
   [1253](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1253) else:
   [1254](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1254)     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\pt_utils.py:125, in PipelineIterator.__next__(self)
    [123](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:123) # We're out of items within a batch
    [124](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:124) item = next(self.iterator)
--> [125](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:125) processed = self.infer(item, **self.params)
    [126](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:126) # We now have a batch of "inferred things".
    [127](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:127) if self.loader_batch_size is not None:
    [128](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:128)     # Try to infer the size of the batch

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:587, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
    [584](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:584)             stride_right /= sampling_rate
    [585](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:585)             output["stride"] = chunk_len, stride_left, stride_right
--> [587](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:587)     text, optional = self.tokenizer._decode_asr(
    [588](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:588)         model_outputs,
    [589](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:589)         return_timestamps=return_timestamps,
    [590](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:590)         return_language=return_language,
    [591](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:591)         time_precision=time_precision,
    [592](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:592)     )
    [593](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:593) else:
    [594](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:594)     items = np.concatenate(final_items, axis=1)

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:832, in WhisperTokenizer._decode_asr(self, model_outputs, return_timestamps, return_language, time_precision)
    [831](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:831) def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_precision):
--> [832](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:832)     return _decode_asr(
    [833](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:833)         self,
    [834](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:834)         model_outputs,
    [835](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:835)         return_timestamps=return_timestamps,
    [836](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:836)         return_language=return_language,
    [837](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:837)         time_precision=time_precision,
    [838](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:838)     )

File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:1032, in _decode_asr(tokenizer, model_outputs, return_timestamps, return_language, time_precision)
   [1030](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1030) current_tokens.append(token)
   [1031](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1031) if return_timestamps == "word":
-> [1032](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1032)     start_time = round(token_timestamps[i] + time_offset, 2)
   [1033](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1033)     if i + 1 < len(token_timestamps):
   [1034](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1034)         end_time = round(token_timestamps[i + 1] + time_offset, 2)

IndexError: list index out of range

I've uploaded the audio that I am trying to process here

I have a discussion on the whisper's hub page, where I have implemented a fix that seems to work. here

Colab link here

Expected behavior

Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
Setup model config with return_timestamps="word", amongst other settings.
Run audio through the pipe, which returns transcription together with the word level timestamps segment.

Jun 28 '24 08:06 maxkvbn

cc @sanchit-gandhi @kamilakesbi

Jun 28 '24 12:06 amyeroberts

Still an issue.

Waiting on PR approval

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: github-actions[bot] @.> Sent: Monday, July 29, 2024 10:04:12 AM To: huggingface/transformers @.> Cc: Max Kristoffer Vendelbo Brown Nielsen @.>; Author @.> Subject: Re: [huggingface/transformers] Whisper - list index out of range with word level timestamps (Issue #31683)

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelineshttps://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md are likely to be ignored.

— Reply to this email directly, view it on GitHubhttps://github.com/huggingface/transformers/issues/31683#issuecomment-2255269765, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2NOSB36G33GUCCOVNDA35LZOXZPZAVCNFSM6AAAAABKBNL5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVGI3DSNZWGU. You are receiving this because you authored the thread.Message ID: @.***>

MED VENLIG HILSEN / BEST REGARDS / MIT FREUNDLICHEN GRÜ?EN

[cid:e4d080de-476d-4134-8d24-7e94debd5f87.jpg]http://www.valeur.dk/ Max Kristoffer Vendelbo Brown Nielsen Phone. +45 86929000 Email. @.*** Michael Drewsens Vej 20a 8270 Højbjerg, Danmark http://www.valeur.dkwww.valeur.dkhttp://www.valeur.dk [cid:015_fb_09e5f968-1567-45a3-9b3a-b8e2037096e5.png]https://www.facebook.com/ValeurAs/ [cid:015_lin_a9d5adba-7aa6-42da-861d-f614b0f359e7.png] https://www.linkedin.com/company/valeur-a-s/ https://twitter.com/user_name_here https://www.youtube.com/user/user_name_here https://www.instagram.com/user_name_here [cid:vss_4fd9b85a-6634-455e-92ad-b3dea1df1c61.png] https://www.valeur.dk/international-telemarketing-2/

Valeur A/S places the highest priority on the security and privacy of our Clients. Therefore, we have put our efforts into ensuring that this message is free of errors and viruses. Despite our efforts, you should always scan all emails for any threats with proper software, as the sender does not accept liability for any damage inflicted by viewing the content of this email.

Jul 29 '24 08:07 maxkvbn

Hey @maxkvbn, super sorry for the late reply here. I've been trying to reproduce the issue locally, but with every case I've tried the current code on main works as expected. For example, here's a reproducer that uses long-form generation with a single audio sample and word-level timestamps:

from transformers import pipeline, AutoProcessor, WhisperForConditionalGeneration
from transformers.utils import is_accelerate_available
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", low_cpu_mem_usage=is_accelerate_available())

pipe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny.en")

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

text = pipe(sample, batch_size=1, return_timestamps="word")

Could you please share a reproducer for the issue that I can run end-to-end? E.g. by sharing the audio file you're using, or updating the code example I've shared to trigger the error.

I've had a quick glance over your PR and the changes seem well-addressed, so if I can confirm it's a fix we need I'm confident we can merge it quickly.

Aug 01 '24 03:08 sanchit-gandhi

Hey @sanchit-gandhi.

Did you try with the uploaded audio from https://github.com/huggingface/transformers/issues/31683#issue-2379903876?

Aug 01 '24 08:08 maxkvbn

cc @ylacombe

Aug 27 '24 11:08 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 17 '24 08:10 github-actions[bot]