System Info

transformers version: 4.48.2
Platform: macOS-15.3-arm64-arm-64bit
Python version: 3.12.1
Huggingface_hub version: 0.28.1
Safetensors version: 0.4.3
Accelerate version: 1.3.0
Accelerate config: not found
PyTorch version (GPU?): 2.6.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No

Who can help?

@ylacombe, @eustlb

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

Download the sample: https://drive.google.com/file/d/19xqGiGc1fse532d6t6u5OGI_oNbTfdJe/view?usp=sharing
Run the sample script:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = 'openai/whisper-small'

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

generate_kwargs = {
    'num_beams': 2,
}

pipe = pipeline(
    'automatic-speech-recognition',
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps='word',
)

result = pipe('efbef66e35e6456ba37461d9c5f12fcd.mp3', generate_kwargs=generate_kwargs)
for chunk in result['chunks']:
    print(f'{chunk["timestamp"]} {chunk['text']}')

Find the following error in _extract_token_timestamps(..):

Traceback (most recent call last):
  File ".....py", line 40, in <module>
    result = pipe(str(file), generate_kwargs=generate_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1354, in __call__
    return next(
           ^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1269, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 521, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 774, in generate
    ) = self.generate_with_fallback(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 965, in generate_with_fallback
    seek_sequences, seek_outputs = self._postprocess_outputs(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1067, in _postprocess_outputs
    seek_outputs["token_timestamps"] = self._extract_token_timestamps(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 268, in _extract_token_timestamps
    torch.index_select(weights[:, :, i, :], dim=0, index=beam_indices[:, i])
                       ~~~~~~~^^^^^^^^^^^^
IndexError: index 447 is out of bounds for dimension 2 with size 447

The sentence-level timestamps produce the following result (return_timestamps=True):

(0.0, 6.0)  Viele Menschen haben auch wirklich richtig geahnt Angst, wenn sie sich trennen möchten.
(6.0, 12.0)  Von ihrem toxischen Partner, weil sie sich bedroht fühlen. Teilweise sogar lebensbedroht fühlen.
(12.0, 17.0)  Darum ist es sehr wichtig, dort auch als Außenstehende behutzamt vorzugehen.
(17.0, 0.0) 
(3.32, 7.92)  und nicht einfach denken, ja, es ist schon voll leise, es ist ja mega klar, gange doch einfach, weil nein, hier hat es viel Angst, viel Druck,
(7.92, 10.32)  wo man nicht einfach so auf die Seite legen kann.

Here, the segment from 17.0 s has an empty string as text.

Furthermore, without beam-search (`generate_kwargs={}´) and word-level timestamps I get the following hallucination:

(0.0, 0.62)  Viele
(0.62, 1.14)  Menschen
(1.14, 1.68)  haben
(1.68, 2.1)  auch
(2.1, 2.78)  richtig
(2.78, 3.1)  richtig
(3.1, 3.58)  geahnt
(3.58, 4.4)  Angst,
(4.4, 4.46)  wenn
(4.46, 4.66)  sie
(4.66, 4.86)  sich
(4.86, 5.98)  trennen
(5.98, 6.3)  möchten.
(6.3, 6.36)  Von
(6.36, 6.56)  ihrem
(6.56, 7.14)  toxischen
(7.14, 7.78)  Partner,
(7.78, 7.84)  weil
(7.84, 7.96)  sie
(7.96, 8.2)  sich
(8.2, 8.78)  bedroht
(8.78, 9.3)  fühlen,
(9.3, 9.6)  teilweise
(9.6, 10.02)  sogar
(10.02, 11.2)  lebensbedroht
(11.2, 12.3)  fühlen.
(12.3, 12.98)  Darum
(12.98, 13.1)  ist
(13.1, 13.24)  es
(13.24, 13.46)  sehr
(13.46, 14.1)  wichtig,
(14.1, 14.24)  dort
(14.24, 14.46)  auch
(14.46, 14.6)  als
(14.6, 15.58)  Außenstehende
(15.58, 16.74)  behutzeim
(16.74, 17.38)  vorzugehen
(17.38, 17.54)  und
(17.54, 17.7)  nicht
(17.7, 17.98)  einfach
(17.98, 18.44)  denken,
(18.44, 18.6)  ja,
(18.6, 18.64)  ja,
...
(22.48, 22.48)  ja,
(22.48, 22.48)  ja,
(22.48, 22.56)  ja,
(22.56, 24.18)  ja,
(24.18, None)  ja,

In another setup where I used a batch size and load the samples with librosa before, I also saw the following error in the same function (_extract_token_timestamps(..)), but I don't know if the root-cause is related:

  File "..../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 283, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1343, in __call__
    outputs = list(final_iterator)
              ^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/pipelines/pt_utils.py", line 269, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/pipelines/base.py", line 1269, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 521, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 774, in generate
    ) = self.generate_with_fallback(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 965, in generate_with_fallback
    seek_sequences, seek_outputs = self._postprocess_outputs(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 1067, in _postprocess_outputs
    seek_outputs["token_timestamps"] = self._extract_token_timestamps(
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "....../lib/python3.12/site-packages/transformers/models/whisper/generation_whisper.py", line 315, in _extract_token_timestamps
    matrix = weights[batch_idx, ..., : num_frames[batch_idx] // 2]
                                       ~~~~~~~~~~^^^^^^^^^^^
IndexError: index 0 is out of bounds for dimension 0 with size 0

Some things I recognized:

Without beam-search, it works (the above example uses beam width = 2)
It must be related to hallucinated repetitions, because I got the same issue with some fine-tuned models which hallucinate, furthermore the above example also hallucinates without beam search.

Maybe someone else can reproduce this with another sample.

Expected behavior

Inference with word-level timestamps should not fail if something is wrong with beam search. I would expect, e.g., None-timestamps as we get for sentence-level timestamps.

Feb 07 '25 14:02 dintifla

Not sure if this is @eustlb or @gante, pinging both (with apologies!)

Feb 07 '25 15:02 Rocketknight1

I am having the same issue (IndexError: index 447 is out of bounds for dimension 2 with size 447). Have you found any solution to this?

Feb 27 '25 16:02 Maria-Ponte

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 24 '25 08:03 github-actions[bot]

Any idea on this one @eustlb @gante ?

Mar 24 '25 08:03 dintifla

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 18 '25 08:04 github-actions[bot]

Not stale.

Apr 18 '25 08:04 maxkvbn

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 13 '25 08:05 github-actions[bot]

Not stale. @eustlb @gante

May 21 '25 08:05 dintifla

👋 Thank you for opening the issue and keeping it live!

I've confirmed that the following self-contained script works fine in e.g. v4.47.0, but not on main. Investigating cause and fix :)

EDIT: actually, the test snippet shared above is failing for much longer (I've tried all minor versions until v4.43, they are all broken 👀 )

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = 'openai/whisper-small'

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

generate_kwargs = {
    'num_beams': 2,
}

pipe = pipeline(
    'automatic-speech-recognition',
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps='word',
)

def _load_datasamples(num_samples):
    ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
    # automatic decoding with librispeech
    speech_samples = ds.sort("id").select(range(num_samples))[:num_samples]["audio"]
    return [x["array"] for x in speech_samples]

result = pipe(_load_datasamples(1), generate_kwargs=generate_kwargs)
for chunk in result[0]['chunks']:
    print(f"{chunk['timestamp']} {chunk['text']}")

May 21 '25 11:05 gante

@dintifla @maxkvbn @Maria-Ponte #38259 fixes the shape issues. Note that word timestamps still have a few issues (e.g. see #36632), and therefore the output is not yet perfect :)

With #38259, the test script above outputs

(0.0, 0.62)  Viele
(0.62, 1.14)  Menschen
(1.14, 1.68)  haben
(1.68, 2.1)  auch
(2.1, 2.72)  wirklich
(2.72, 3.1)  richtig
(3.1, 3.56)  geahnt
(3.56, 4.4)  Angst,
(4.4, 4.46)  wenn
(4.46, 4.66)  sie
(4.66, 4.86)  sich
(4.86, 5.98)  trennen
(5.98, 6.34)  möchten,
(6.34, 6.34)  von
(6.34, 6.56)  ihrem
(6.56, 7.14)  toxischen
(7.14, 7.74)  Partner,
(7.74, 7.84)  weil
(7.84, 7.96)  sie
(7.96, 8.2)  sich
(8.2, 8.78)  bedroht
(8.78, 9.3)  fühlen,
(9.3, 9.6)  teilweise
(9.6, 10.02)  sogar
(10.02, 11.2)  lebensbedroht
(11.2, 12.3)  fühlen.
(12.3, 12.98)  Darum
(12.98, 13.1)  ist
(13.1, 13.24)  es
(13.24, 13.46)  sehr
(13.46, 14.1)  wichtig,
(14.1, 14.24)  dort
(14.24, 14.46)  auch
(14.46, 14.6)  als
(14.6, 15.56)  Außenstehende
(15.56, 16.66)  behutzamt
(16.66, 17.36)  vorzugehen
(...)
(21.64, 21.64)  ja,
(21.64, 21.64)  ja,
(21.64, 21.64)  ja,
(21.64, 22.48)  ja,
(22.48, 22.48)  ja,
(22.48, 22.48)  ja,
(22.48, 22.48)  ja,
(22.48, 22.48)  ja,
(22.48, 22.56)  ja,
(22.56, 24.18)  ja,
(24.18, None)  ja,

May 21 '25 14:05 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 15 '25 08:06 github-actions[bot]

as gante states, the issue does not seem fully resolved

Jun 16 '25 12:06 dintifla

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 11 '25 08:07 github-actions[bot]

as gante states, the issue does not seem fully resolved

Jul 11 '25 08:07 hypermx

Hey Guys,

I tested with script as given by bug filer

#####################################################

import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu' torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = 'openai/whisper-small'

model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

generate_kwargs = { 'num_beams': 2 }

pipe = pipeline( 'automatic-speech-recognition', model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, return_timestamps='word', )

result = pipe('C:/Users/Lenovo/Transformer/efbef66e35e6456ba37461d9c5f12fcd.mp3', generate_kwargs=generate_kwargs) for chunk in result['chunks']: print(f'{chunk["timestamp"]} {chunk['text']}')

#########################################################################

With release version i.e., Version: 4.53.3, I could reproduce the issue as discussed above

pip show transformers

Name: transformers Version: 4.53.3 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: [email protected] License: Apache 2.0 License Location: C:\Users\Lenovo\myenv\Lib\site-packages Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: trl (myenv) PS C:\Users\Lenovo> python .\Transformer\bug_36093.py Device set to use cpu C:\Users\Lenovo\myenv\Lib\site-packages\transformers\models\whisper\generation_whisper.py:604: FutureWarning: The input name inputs is deprecated. Please make sure to use input_features instead. warnings.warn( Using custom forced_decoder_ids from the (generation) config. This is deprecated in favor of the task and language flags/config options. Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'. See https://github.com/huggingface/transformers/pull/28687 for more details. Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation. (0.0, 0.62) Viele (0.62, 1.14) Menschen (1.14, 1.68) haben (1.68, 2.1) auch (2.1, 2.72) wirklich (2.72, 3.1) richtig (3.1, 3.56) geahnt (3.56, 4.4) Angst, (4.4, 4.46) wenn (4.46, 4.66) sie (4.66, 4.86) sich (4.86, 5.98) trennen (5.98, 6.34) möchten, (6.34, 6.34) von (6.34, 6.56) ihrem (6.56, 7.14) toxischen (7.14, 7.74) Partner, (7.74, 7.84) weil (7.84, 7.96) sie (7.96, 8.2) sich (8.2, 8.78) bedroht (8.78, 9.3) fühlen, (9.3, 9.6) teilweise (9.6, 10.02) sogar (10.02, 11.2) lebensbedroht (11.2, 12.3) fühlen. (12.3, 12.98) Darum (12.98, 13.1) ist (13.1, 13.24) es (13.24, 13.46) sehr (13.46, 14.1) wichtig, (14.1, 14.24) dort (14.24, 14.46) auch (14.46, 14.6) als (14.6, 15.56) Außenstehende (15.56, 16.66) behutzamt (16.66, 17.36) vorzugehen (17.36, 17.54) und (17.54, 17.7) nicht (17.7, 17.96) einfach (17.96, 18.42) denken, (18.42, 18.6) ja, (18.6, 18.64) ja, (18.64, 18.84) ja, (18.84, 18.86) ja, (18.86, 18.98) ja, (18.98, 19.06) ja, (19.06, 19.62) ja, (19.62, 19.62) ja, (19.62, 19.76) ja, (19.76, 19.78) ja, (19.78, 20.16) ja, (20.16, 20.16) ja, (20.16, 20.2) ja, (20.2, 20.2) ja, (20.2, 20.28) ja, (20.28, 20.5) ja, (20.5, 20.5) ja, (20.5, 20.5) ja, (20.5, 21.04) ja, (21.04, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.1) ja, (21.1, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.14) ja, (21.14, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.16) ja, (21.16, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.18) ja, (21.18, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.38) ja, (21.38, 21.44) ja, (21.44, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.48) ja, (21.48, 21.64) ja, (21.64, 21.64) ja, (21.64, 21.64) ja, (21.64, 21.64) ja, (21.64, 21.64) ja, (21.64, 21.92) ja, (21.92, 21.98) ja, (21.98, 21.98) ja, (21.98, 21.98) ja, (21.98, 21.98) ja, (21.98, 21.98) ja, (21.98, 21.98) ja, (21.98, 22.48) ja, (22.48, 22.48) ja, (22.48, 22.48) ja, (22.48, 22.48) ja, (22.48, 22.48) ja, (22.48, 22.56) ja, (22.56, 24.18) ja, (24.18, None) ja, ##############################################################################################

However, with latest dev version i.e., Version: 4.54.0.dev0, issue is not producing :

pip show transformers

Name: transformers Version: 4.54.0.dev0 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors) Author-email: [email protected] License: Apache 2.0 License Location: C:\Users\Lenovo\myenv\Lib\site-packages Editable project location: C:\Users\Lenovo\transformers Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm Required-by: trl (myenv) PS C:\Users\Lenovo> python .\Transformer\bug_36093.py Device set to use cpu return_token_timestamps is deprecated for WhisperFeatureExtractor and will be removed in Transformers v5. Use return_attention_mask instead, as the number of frames can be inferred from it. Using custom forced_decoder_ids from the (generation) config. This is deprecated in favor of the task and language flags/config options. Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'. See https://github.com/huggingface/transformers/pull/28687 for more details. (0.0, 0.62) Viele (0.62, 1.08) Menschen (1.08, 1.66) haben (1.66, 2.1) auch (2.1, 2.72) wirklich (2.72, 3.1) richtig (3.1, 3.56) geahnt (3.56, 4.38) Angst, (4.38, 4.46) wenn (4.46, 4.66) sie (4.66, 4.86) sich (4.86, 5.98) trennen (5.98, 6.26) möchten. (6.26, 6.36) Von (6.36, 6.58) ihrem (6.58, 7.14) toxischen (7.14, 7.78) Partner, (7.78, 7.84) weil (7.84, 7.96) sie (7.96, 8.2) sich (8.2, 8.78) bedroht (8.78, 9.28) fühlen. (9.28, 9.7) Teilweise (9.7, 10.02) sogar (10.02, 11.2) lebensbedroht (11.2, 12.0) fühlen. (12.48, 12.98) Darum (12.98, 13.1) ist (13.1, 13.24) es (13.24, 13.46) sehr (13.46, 14.1) wichtig, (14.1, 14.26) dort (14.26, 14.46) auch (14.46, 14.6) als (14.6, 15.56) Außenstehende (15.56, 16.66) behutzamt (16.66, 17.44) vorzugehen. (17.0, 17.52) und (17.52, 17.7) nicht (17.7, 17.98) einfach (17.98, 18.46) denken, (18.46, 18.58) ja, (18.58, 18.66) es (18.66, 18.78) ist (18.78, 18.8) schon (18.8, 19.02) voll (19.02, 19.42) leise, (19.42, 19.48) es (19.48, 19.54) ist (19.54, 19.62) ja (19.62, 19.82) mega (19.82, 20.42) klar, (20.48, 20.54) gange (20.54, 20.74) doch (20.74, 21.68) einfach, (21.68, 21.88) weil (21.88, 22.68) nein, (22.68, 22.74) hier (22.74, 22.92) hat (22.92, 23.02) es (23.02, 23.28) viel (23.28, 24.18) Angst, (24.18, 24.5) viel (24.5, 25.12) Druck, (25.24, 25.26) wo (25.26, 25.4) man (25.4, 25.58) nicht (25.58, 26.0) einfach (26.0, 26.2) so (26.2, 26.4) auf (26.4, 26.48) die (26.48, 26.64) Seite (26.64, 26.9) legen (26.9, 27.38) kann.

################################################################################

Has it already been fixed ? @gante @dintifla

Jul 23 '25 08:07 rohitthewanderer

@rohitthewanderer Thanks for testing. I could not reproduce it either with the given sample. I recognized that there is no more hallucination. Maybe the issue still persists with other samples, which do hallucinate/repeat output?

Jul 28 '25 08:07 dintifla

@rohitthewanderer Thanks for testing. I could not reproduce it either with the given sample. I recognized that there is no more hallucination. Maybe the issue still persists with other samples, which do hallucinate/repeat output?

Not sure about other scripts, do you have any sample in mind, wherein it may get reproduced ?

Jul 28 '25 09:07 rohitthewanderer

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 22 '25 08:08 github-actions[bot]

Whisper word-level timestamp extraction fails with beam search

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

pip show transformers

pip show transformers