Whisper word-level timestamp extraction fails with beam search
System Info
-
transformersversion: 4.48.2 - Platform: macOS-15.3-arm64-arm-64bit
- Python version: 3.12.1
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.4.3
- Accelerate version: 1.3.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.6.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
Who can help?
@ylacombe, @eustlb
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- Download the sample: https://drive.google.com/file/d/19xqGiGc1fse532d6t6u5OGI_oNbTfdJe/view?usp=sharing
- Run the sample script:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = 'openai/whisper-small'
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
generate_kwargs = {
'num_beams': 2,
}
pipe = pipeline(
'automatic-speech-recognition',
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
return_timestamps='word',
)
result = pipe('efbef66e35e6456ba37461d9c5f12fcd.mp3', generate_kwargs=generate_kwargs)
for chunk in result['chunks']:
print(f'{chunk["timestamp"]} {chunk['text']}')
- Find the following error in
_extract_token_timestamps(..):
Traceback (most recent call last):
File ".....py", line 40, in <module>
result = pipe(str(file), generate_kwargs=generate_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1354, in __call__
return next(
^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1269, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 521, in _forward
tokens = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 774, in generate
) = self.generate_with_fallback(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 965, in generate_with_fallback
seek_sequences, seek_outputs = self._postprocess_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1067, in _postprocess_outputs
seek_outputs["token_timestamps"] = self._extract_token_timestamps(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 268, in _extract_token_timestamps
torch.index_select(weights[:, :, i, :], dim=0, index=beam_indices[:, i])
~~~~~~~^^^^^^^^^^^^
IndexError: index 447 is out of bounds for dimension 2 with size 447
The sentence-level timestamps produce the following result (return_timestamps=True):
(0.0, 6.0) Viele Menschen haben auch wirklich richtig geahnt Angst, wenn sie sich trennen möchten.
(6.0, 12.0) Von ihrem toxischen Partner, weil sie sich bedroht fühlen. Teilweise sogar lebensbedroht fühlen.
(12.0, 17.0) Darum ist es sehr wichtig, dort auch als Außenstehende behutzamt vorzugehen.
(17.0, 0.0)
(3.32, 7.92) und nicht einfach denken, ja, es ist schon voll leise, es ist ja mega klar, gange doch einfach, weil nein, hier hat es viel Angst, viel Druck,
(7.92, 10.32) wo man nicht einfach so auf die Seite legen kann.
Here, the segment from 17.0 s has an empty string as text.
Furthermore, without beam-search (`generate_kwargs={}´) and word-level timestamps I get the following hallucination:
(0.0, 0.62) Viele
(0.62, 1.14) Menschen
(1.14, 1.68) haben
(1.68, 2.1) auch
(2.1, 2.78) richtig
(2.78, 3.1) richtig
(3.1, 3.58) geahnt
(3.58, 4.4) Angst,
(4.4, 4.46) wenn
(4.46, 4.66) sie
(4.66, 4.86) sich
(4.86, 5.98) trennen
(5.98, 6.3) möchten.
(6.3, 6.36) Von
(6.36, 6.56) ihrem
(6.56, 7.14) toxischen
(7.14, 7.78) Partner,
(7.78, 7.84) weil
(7.84, 7.96) sie
(7.96, 8.2) sich
(8.2, 8.78) bedroht
(8.78, 9.3) fühlen,
(9.3, 9.6) teilweise
(9.6, 10.02) sogar
(10.02, 11.2) lebensbedroht
(11.2, 12.3) fühlen.
(12.3, 12.98) Darum
(12.98, 13.1) ist
(13.1, 13.24) es
(13.24, 13.46) sehr
(13.46, 14.1) wichtig,
(14.1, 14.24) dort
(14.24, 14.46) auch
(14.46, 14.6) als
(14.6, 15.58) Außenstehende
(15.58, 16.74) behutzeim
(16.74, 17.38) vorzugehen
(17.38, 17.54) und
(17.54, 17.7) nicht
(17.7, 17.98) einfach
(17.98, 18.44) denken,
(18.44, 18.6) ja,
(18.6, 18.64) ja,
...
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.56) ja,
(22.56, 24.18) ja,
(24.18, None) ja,
In another setup where I used a batch size and load the samples with librosa before, I also saw the following error in the same function (_extract_token_timestamps(..)), but I don't know if the root-cause is related:
File "..../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1343, in __call__
outputs = list(final_iterator)
^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1269, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 521, in _forward
tokens = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 774, in generate
) = self.generate_with_fallback(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 965, in generate_with_fallback
seek_sequences, seek_outputs = self._postprocess_outputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1067, in _postprocess_outputs
seek_outputs["token_timestamps"] = self._extract_token_timestamps(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 315, in _extract_token_timestamps
matrix = weights[batch_idx, ..., : num_frames[batch_idx] // 2]
~~~~~~~~~~^^^^^^^^^^^
IndexError: index 0 is out of bounds for dimension 0 with size 0
Some things I recognized:
- Without beam-search, it works (the above example uses beam width = 2)
- It must be related to hallucinated repetitions, because I got the same issue with some fine-tuned models which hallucinate, furthermore the above example also hallucinates without beam search.
Maybe someone else can reproduce this with another sample.
Expected behavior
Inference with word-level timestamps should not fail if something is wrong with beam search. I would expect, e.g., None-timestamps as we get for sentence-level timestamps.
Not sure if this is @eustlb or @gante, pinging both (with apologies!)
I am having the same issue (IndexError: index 447 is out of bounds for dimension 2 with size 447). Have you found any solution to this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Any idea on this one @eustlb @gante ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not stale.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Not stale. @eustlb @gante
👋 Thank you for opening the issue and keeping it live!
I've confirmed that the following self-contained script works fine in e.g. v4.47.0, but not on main. Investigating cause and fix :)
EDIT: actually, the test snippet shared above is failing for much longer (I've tried all minor versions until v4.43, they are all broken 👀 )
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = 'openai/whisper-small'
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
generate_kwargs = {
'num_beams': 2,
}
pipe = pipeline(
'automatic-speech-recognition',
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
return_timestamps='word',
)
def _load_datasamples(num_samples):
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
# automatic decoding with librispeech
speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
return [x["array"] for x in speech_samples]
result = pipe(_load_datasamples(1), generate_kwargs=generate_kwargs)
for chunk in result[0]['chunks']:
print(f"{chunk['timestamp']} {chunk['text']}")
@dintifla @maxkvbn @Maria-Ponte #38259 fixes the shape issues. Note that word timestamps still have a few issues (e.g. see #36632), and therefore the output is not yet perfect :)
With #38259, the test script above outputs
(0.0, 0.62) Viele
(0.62, 1.14) Menschen
(1.14, 1.68) haben
(1.68, 2.1) auch
(2.1, 2.72) wirklich
(2.72, 3.1) richtig
(3.1, 3.56) geahnt
(3.56, 4.4) Angst,
(4.4, 4.46) wenn
(4.46, 4.66) sie
(4.66, 4.86) sich
(4.86, 5.98) trennen
(5.98, 6.34) möchten,
(6.34, 6.34) von
(6.34, 6.56) ihrem
(6.56, 7.14) toxischen
(7.14, 7.74) Partner,
(7.74, 7.84) weil
(7.84, 7.96) sie
(7.96, 8.2) sich
(8.2, 8.78) bedroht
(8.78, 9.3) fühlen,
(9.3, 9.6) teilweise
(9.6, 10.02) sogar
(10.02, 11.2) lebensbedroht
(11.2, 12.3) fühlen.
(12.3, 12.98) Darum
(12.98, 13.1) ist
(13.1, 13.24) es
(13.24, 13.46) sehr
(13.46, 14.1) wichtig,
(14.1, 14.24) dort
(14.24, 14.46) auch
(14.46, 14.6) als
(14.6, 15.56) Außenstehende
(15.56, 16.66) behutzamt
(16.66, 17.36) vorzugehen
(...)
(21.64, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.56) ja,
(22.56, 24.18) ja,
(24.18, None) ja,
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
as gante states, the issue does not seem fully resolved
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
as gante states, the issue does not seem fully resolved
Hey Guys,
I tested with script as given by bug filer
#####################################################
import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = 'openai/whisper-small'
model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
generate_kwargs = { 'num_beams': 2 }
pipe = pipeline( 'automatic-speech-recognition', model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, return_timestamps='word', )
result = pipe('C:/Users/Lenovo/Transformer/efbef66e35e6456ba37461d9c5f12fcd.mp3', generate_kwargs=generate_kwargs) for chunk in result['chunks']: print(f'{chunk["timestamp"]} {chunk['text']}')
#########################################################################
With release version i.e., Version: 4.53.3, I could reproduce the issue as discussed above
pip show transformers
Name: transformers
Version: 4.53.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: C:\Users\Lenovo\myenv\Lib\site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: trl
(myenv) PS C:\Users\Lenovo> python .\Transformer\bug_36093.py
Device set to use cpu
C:\Users\Lenovo\myenv\Lib\site-packages\transformers\models\whisper\generation_whisper.py:604: FutureWarning: The input name inputs is deprecated. Please make sure to use input_features instead.
warnings.warn(
Using custom forced_decoder_ids from the (generation) config. This is deprecated in favor of the task and language flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'. See https://github.com/huggingface/transformers/pull/28687 for more details.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.
(0.0, 0.62) Viele
(0.62, 1.14) Menschen
(1.14, 1.68) haben
(1.68, 2.1) auch
(2.1, 2.72) wirklich
(2.72, 3.1) richtig
(3.1, 3.56) geahnt
(3.56, 4.4) Angst,
(4.4, 4.46) wenn
(4.46, 4.66) sie
(4.66, 4.86) sich
(4.86, 5.98) trennen
(5.98, 6.34) möchten,
(6.34, 6.34) von
(6.34, 6.56) ihrem
(6.56, 7.14) toxischen
(7.14, 7.74) Partner,
(7.74, 7.84) weil
(7.84, 7.96) sie
(7.96, 8.2) sich
(8.2, 8.78) bedroht
(8.78, 9.3) fühlen,
(9.3, 9.6) teilweise
(9.6, 10.02) sogar
(10.02, 11.2) lebensbedroht
(11.2, 12.3) fühlen.
(12.3, 12.98) Darum
(12.98, 13.1) ist
(13.1, 13.24) es
(13.24, 13.46) sehr
(13.46, 14.1) wichtig,
(14.1, 14.24) dort
(14.24, 14.46) auch
(14.46, 14.6) als
(14.6, 15.56) Außenstehende
(15.56, 16.66) behutzamt
(16.66, 17.36) vorzugehen
(17.36, 17.54) und
(17.54, 17.7) nicht
(17.7, 17.96) einfach
(17.96, 18.42) denken,
(18.42, 18.6) ja,
(18.6, 18.64) ja,
(18.64, 18.84) ja,
(18.84, 18.86) ja,
(18.86, 18.98) ja,
(18.98, 19.06) ja,
(19.06, 19.62) ja,
(19.62, 19.62) ja,
(19.62, 19.76) ja,
(19.76, 19.78) ja,
(19.78, 20.16) ja,
(20.16, 20.16) ja,
(20.16, 20.2) ja,
(20.2, 20.2) ja,
(20.2, 20.28) ja,
(20.28, 20.5) ja,
(20.5, 20.5) ja,
(20.5, 20.5) ja,
(20.5, 21.04) ja,
(21.04, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.1) ja,
(21.1, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.14) ja,
(21.14, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.16) ja,
(21.16, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.18) ja,
(21.18, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.38) ja,
(21.38, 21.44) ja,
(21.44, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.48) ja,
(21.48, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 21.64) ja,
(21.64, 21.92) ja,
(21.92, 21.98) ja,
(21.98, 21.98) ja,
(21.98, 21.98) ja,
(21.98, 21.98) ja,
(21.98, 21.98) ja,
(21.98, 21.98) ja,
(21.98, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.48) ja,
(22.48, 22.56) ja,
(22.56, 24.18) ja,
(24.18, None) ja,
##############################################################################################
However, with latest dev version i.e., Version: 4.54.0.dev0, issue is not producing :
pip show transformers
Name: transformers
Version: 4.54.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: C:\Users\Lenovo\myenv\Lib\site-packages
Editable project location: C:\Users\Lenovo\transformers
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: trl
(myenv) PS C:\Users\Lenovo> python .\Transformer\bug_36093.py
Device set to use cpu
return_token_timestamps is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use return_attention_mask instead, as the number of frames can be inferred from it.
Using custom forced_decoder_ids from the (generation) config. This is deprecated in favor of the task and language flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'. See https://github.com/huggingface/transformers/pull/28687 for more details.
(0.0, 0.62) Viele
(0.62, 1.08) Menschen
(1.08, 1.66) haben
(1.66, 2.1) auch
(2.1, 2.72) wirklich
(2.72, 3.1) richtig
(3.1, 3.56) geahnt
(3.56, 4.38) Angst,
(4.38, 4.46) wenn
(4.46, 4.66) sie
(4.66, 4.86) sich
(4.86, 5.98) trennen
(5.98, 6.26) möchten.
(6.26, 6.36) Von
(6.36, 6.58) ihrem
(6.58, 7.14) toxischen
(7.14, 7.78) Partner,
(7.78, 7.84) weil
(7.84, 7.96) sie
(7.96, 8.2) sich
(8.2, 8.78) bedroht
(8.78, 9.28) fühlen.
(9.28, 9.7) Teilweise
(9.7, 10.02) sogar
(10.02, 11.2) lebensbedroht
(11.2, 12.0) fühlen.
(12.48, 12.98) Darum
(12.98, 13.1) ist
(13.1, 13.24) es
(13.24, 13.46) sehr
(13.46, 14.1) wichtig,
(14.1, 14.26) dort
(14.26, 14.46) auch
(14.46, 14.6) als
(14.6, 15.56) Außenstehende
(15.56, 16.66) behutzamt
(16.66, 17.44) vorzugehen.
(17.0, 17.52) und
(17.52, 17.7) nicht
(17.7, 17.98) einfach
(17.98, 18.46) denken,
(18.46, 18.58) ja,
(18.58, 18.66) es
(18.66, 18.78) ist
(18.78, 18.8) schon
(18.8, 19.02) voll
(19.02, 19.42) leise,
(19.42, 19.48) es
(19.48, 19.54) ist
(19.54, 19.62) ja
(19.62, 19.82) mega
(19.82, 20.42) klar,
(20.48, 20.54) gange
(20.54, 20.74) doch
(20.74, 21.68) einfach,
(21.68, 21.88) weil
(21.88, 22.68) nein,
(22.68, 22.74) hier
(22.74, 22.92) hat
(22.92, 23.02) es
(23.02, 23.28) viel
(23.28, 24.18) Angst,
(24.18, 24.5) viel
(24.5, 25.12) Druck,
(25.24, 25.26) wo
(25.26, 25.4) man
(25.4, 25.58) nicht
(25.58, 26.0) einfach
(26.0, 26.2) so
(26.2, 26.4) auf
(26.4, 26.48) die
(26.48, 26.64) Seite
(26.64, 26.9) legen
(26.9, 27.38) kann.
################################################################################
Has it already been fixed ? @gante @dintifla
@rohitthewanderer Thanks for testing. I could not reproduce it either with the given sample. I recognized that there is no more hallucination. Maybe the issue still persists with other samples, which do hallucinate/repeat output?
@rohitthewanderer Thanks for testing. I could not reproduce it either with the given sample. I recognized that there is no more hallucination. Maybe the issue still persists with other samples, which do hallucinate/repeat output?
Not sure about other scripts, do you have any sample in mind, wherein it may get reproduced ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.