Whisper - list index out of range with word level timestamps
System Info
-
transformersversion: 4.42.2 - Platform: Windows-10-10.0.22621-SP0
- Python version: 3.10.14
- Huggingface_hub version: 0.23.4
- Safetensors version: 0.4.2
- Accelerate version: 0.31.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: Yes
- GPU type: NVIDIA GeForce RTX 4070 Laptop GPU
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
- Setup model config with return_timestamps="word", amongst other settings.
- Run audio through the pipe, which results in error
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[8], [line 1](vscode-notebook-cell:?execution_count=8&line=1)
----> [1](vscode-notebook-cell:?execution_count=8&line=1) asr_out = transcribing_pipe(
[2](vscode-notebook-cell:?execution_count=8&line=2) '../SampleData/Saba_interview_short.wav',
[3](vscode-notebook-cell:?execution_count=8&line=3) return_timestamps="word",
[4](vscode-notebook-cell:?execution_count=8&line=4) generate_kwargs={"language": "danish"}
[5](vscode-notebook-cell:?execution_count=8&line=5) )
[7](vscode-notebook-cell:?execution_count=8&line=7) asr_out
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:284, in AutomaticSpeechRecognitionPipeline.__call__(self, inputs, **kwargs)
[221](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:221) def __call__(
[222](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:222) self,
[223](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:223) inputs: Union[np.ndarray, bytes, str],
[224](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:224) **kwargs,
[225](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:225) ):
[226](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:226) """
[227](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:227) Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
[228](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:228) documentation for more information.
(...)
[282](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:282) `"".join(chunk["text"] for chunk in output["chunks"])`.
[283](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:283) """
--> [284](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:284) return super().__call__(inputs, **kwargs)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\base.py:1246, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
[1244](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1244) return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
[1245](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1245) elif self.framework == "pt" and isinstance(self, ChunkPipeline):
-> [1246](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1246) return next(
[1247](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1247) iter(
[1248](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1248) self.get_iterator(
[1249](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1249) [inputs], num_workers, batch_size, preprocess_params, forward_params, postprocess_params
[1250](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1250) )
[1251](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1251) )
[1252](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1252) )
[1253](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1253) else:
[1254](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/base.py:1254) return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\pt_utils.py:125, in PipelineIterator.__next__(self)
[123](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:123) # We're out of items within a batch
[124](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:124) item = next(self.iterator)
--> [125](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:125) processed = self.infer(item, **self.params)
[126](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:126) # We now have a batch of "inferred things".
[127](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:127) if self.loader_batch_size is not None:
[128](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/pt_utils.py:128) # Try to infer the size of the batch
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\pipelines\automatic_speech_recognition.py:587, in AutomaticSpeechRecognitionPipeline.postprocess(self, model_outputs, decoder_kwargs, return_timestamps, return_language)
[584](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:584) stride_right /= sampling_rate
[585](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:585) output["stride"] = chunk_len, stride_left, stride_right
--> [587](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:587) text, optional = self.tokenizer._decode_asr(
[588](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:588) model_outputs,
[589](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:589) return_timestamps=return_timestamps,
[590](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:590) return_language=return_language,
[591](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:591) time_precision=time_precision,
[592](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:592) )
[593](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:593) else:
[594](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/pipelines/automatic_speech_recognition.py:594) items = np.concatenate(final_items, axis=1)
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:832, in WhisperTokenizer._decode_asr(self, model_outputs, return_timestamps, return_language, time_precision)
[831](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:831) def _decode_asr(self, model_outputs, *, return_timestamps, return_language, time_precision):
--> [832](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:832) return _decode_asr(
[833](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:833) self,
[834](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:834) model_outputs,
[835](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:835) return_timestamps=return_timestamps,
[836](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:836) return_language=return_language,
[837](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:837) time_precision=time_precision,
[838](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:838) )
File c:\Users\User\miniconda3\envs\vva\lib\site-packages\transformers\models\whisper\tokenization_whisper.py:1032, in _decode_asr(tokenizer, model_outputs, return_timestamps, return_language, time_precision)
[1030](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1030) current_tokens.append(token)
[1031](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1031) if return_timestamps == "word":
-> [1032](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1032) start_time = round(token_timestamps[i] + time_offset, 2)
[1033](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1033) if i + 1 < len(token_timestamps):
[1034](file:///C:/Users/User/miniconda3/envs/vva/lib/site-packages/transformers/models/whisper/tokenization_whisper.py:1034) end_time = round(token_timestamps[i + 1] + time_offset, 2)
IndexError: list index out of range
I've uploaded the audio that I am trying to process here
I have a discussion on the whisper's hub page, where I have implemented a fix that seems to work. here
Colab link here
Expected behavior
- Load 'whisper-large-v3' AutoModelForSpeechSeq2Seq model and send it to GPU
- Setup model config with return_timestamps="word", amongst other settings.
- Run audio through the pipe, which returns transcription together with the word level timestamps segment.
cc @sanchit-gandhi @kamilakesbi
Still an issue.
Waiting on PR approval
Sent from Outlook for Androidhttps://aka.ms/AAb9ysg
From: github-actions[bot] @.> Sent: Monday, July 29, 2024 10:04:12 AM To: huggingface/transformers @.> Cc: Max Kristoffer Vendelbo Brown Nielsen @.>; Author @.> Subject: Re: [huggingface/transformers] Whisper - list index out of range with word level timestamps (Issue #31683)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelineshttps://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md are likely to be ignored.
— Reply to this email directly, view it on GitHubhttps://github.com/huggingface/transformers/issues/31683#issuecomment-2255269765, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2NOSB36G33GUCCOVNDA35LZOXZPZAVCNFSM6AAAAABKBNL5LWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJVGI3DSNZWGU. You are receiving this because you authored the thread.Message ID: @.***>
MED VENLIG HILSEN / BEST REGARDS / MIT FREUNDLICHEN GRÜ?EN
[cid:e4d080de-476d-4134-8d24-7e94debd5f87.jpg]http://www.valeur.dk/ Max Kristoffer Vendelbo Brown Nielsen Phone. +45 86929000 Email. @.*** Michael Drewsens Vej 20a 8270 Højbjerg, Danmark http://www.valeur.dkwww.valeur.dkhttp://www.valeur.dk [cid:015_fb_09e5f968-1567-45a3-9b3a-b8e2037096e5.png]https://www.facebook.com/ValeurAs/ [cid:015_lin_a9d5adba-7aa6-42da-861d-f614b0f359e7.png] https://www.linkedin.com/company/valeur-a-s/ https://twitter.com/user_name_here https://www.youtube.com/user/user_name_here https://www.instagram.com/user_name_here [cid:vss_4fd9b85a-6634-455e-92ad-b3dea1df1c61.png] https://www.valeur.dk/international-telemarketing-2/
Valeur A/S places the highest priority on the security and privacy of our Clients. Therefore, we have put our efforts into ensuring that this message is free of errors and viruses. Despite our efforts, you should always scan all emails for any threats with proper software, as the sender does not accept liability for any damage inflicted by viewing the content of this email.
Hey @maxkvbn, super sorry for the late reply here. I've been trying to reproduce the issue locally, but with every case I've tried the current code on main works as expected. For example, here's a reproducer that uses long-form generation with a single audio sample and word-level timestamps:
from transformers import pipeline, AutoProcessor, WhisperForConditionalGeneration
from transformers.utils import is_accelerate_available
from datasets import load_dataset
processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", low_cpu_mem_usage=is_accelerate_available())
pipe = pipeline("automatic-speech-recognition", model="openai/whisper-tiny.en")
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
text = pipe(sample, batch_size=1, return_timestamps="word")
Could you please share a reproducer for the issue that I can run end-to-end? E.g. by sharing the audio file you're using, or updating the code example I've shared to trigger the error.
I've had a quick glance over your PR and the changes seem well-addressed, so if I can confirm it's a fix we need I'm confident we can merge it quickly.
Hey @sanchit-gandhi.
Did you try with the uploaded audio from https://github.com/huggingface/transformers/issues/31683#issue-2379903876?
cc @ylacombe
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.