IndexError: index -1 is out of bounds for dimension 1 with size 0
System Info
PC: M2
transformers== 4.31.0.dev0
refer: https://github.com/openai/whisper/discussions/1478
meet the error:
in <module>:9 โ
โ โ
โ 6 prompt_ids = processor.get_prompt_ids(prompt) โ
โ 7 โ
โ 8 forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe") โ
โ โฑ 9 predicted_ids = model.generate(input_features, prompt_ids=prompt_ids, forced_decoder_ids โ
โ 10 โ โ โ โ โ โ โ max_new_tokens=3000) โ
โ 11 transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) โ
โ 12 print("่ๆถ:", time.time() - start_time, transcription) โ
โ โ
โ /Users/diaojunxian/anaconda3/envs/3.9/lib/python3.9/site-packages/transformers/models/whisper/mo โ
โ deling_whisper.py:1664 in generate โ
โ โ
โ 1661 โ โ if generation_config.return_timestamps: โ
โ 1662 โ โ โ logits_processor = [WhisperTimeStampLogitsProcessor(generation_config)] โ
โ 1663 โ โ โ
โ โฑ 1664 โ โ return super().generate( โ
โ 1665 โ โ โ inputs, โ
โ 1666 โ โ โ generation_config, โ
โ 1667 โ โ โ logits_processor, โ
โ โ
โ /Users/diaojunxian/anaconda3/envs/3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py:115 โ
โ in decorate_context โ
โ โ
โ 112 โ @functools.wraps(func) โ
โ 113 โ def decorate_context(*args, **kwargs): โ
โ 114 โ โ with ctx_factory(): โ
โ โฑ 115 โ โ โ return func(*args, **kwargs) โ
โ 116 โ โ
โ 117 โ return decorate_context โ
โ 118 โ
โ โ
โ /Users/diaojunxian/anaconda3/envs/3.9/lib/python3.9/site-packages/transformers/generation/utils. โ
โ py:1522 in generate โ
โ โ
โ 1519 โ โ โ โ ) โ
โ 1520 โ โ โ โ
โ 1521 โ โ โ # 11. run greedy search โ
โ โฑ 1522 โ โ โ return self.greedy_search( โ
โ 1523 โ โ โ โ input_ids, โ
โ 1524 โ โ โ โ logits_processor=logits_processor, โ
โ 1525 โ โ โ โ stopping_criteria=stopping_criteria, โ
โ โ
โ /Users/diaojunxian/anaconda3/envs/3.9/lib/python3.9/site-packages/transformers/generation/utils. โ
โ py:2349 in greedy_search โ
โ โ
โ 2346 โ โ โ if synced_gpus and this_peer_finished: โ
โ 2347 โ โ โ โ continue # don't waste resources running the code we don't need โ
โ 2348 โ โ โ โ
โ โฑ 2349 โ โ โ next_token_logits = outputs.logits[:, -1, :] โ
โ 2350 โ โ โ โ
โ 2351 โ โ โ # pre-process distribution โ
โ 2352 โ โ โ next_tokens_scores = logits_processor(input_ids, next_token_logits)
use these code all occur error.
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import librosa
import soundfile
import torchaudio
base_model = "/Users/ddd/Documents/github/whisper-large-v2"
processor = WhisperProcessor.from_pretrained(base_model,
language="zh",
task="transcribe",
local_files_only="True")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")
# ่ทๅๆจกๅ
model = WhisperForConditionalGeneration.from_pretrained(base_model,
device_map="auto",
local_files_only=True).half()
model.eval()
audio_file = "/Users/ddd/Documents/gitlab/llm-train/yuyin/simple.m4a"
src_signal, sample_rate = librosa.load(audio_file, sr=16000)
start = 23196064
end = 23364576
src_signal_demo = src_signal[start:end]
input_features = processor(src_signal_demo, sampling_rate=sample_rate, return_tensors="pt").input_features.half().to("mps")
prompt = 'ไปฅไธๆฏๆฎ้่ฏ็ๅฅๅญ'
prompt_ids = processor.get_prompt_ids(prompt)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")
predicted_ids = model.generate(input_features, prompt_ids=prompt_ids, forced_decoder_ids=forced_decoder_ids,
max_new_tokens=3000)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
from transformers import pipeline
pipe = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-large-v2",
device="mps",
chunk_length_s=30, # if not precised then only generate as much as `max_new_tokens`
generate_kwargs = {"num_beams": 5} # same as setting as "openai whisper" default
)
audio_file = "/Users/ddd/Documents/gitlab/llm-train/yuyin/simple.m4a"
src_signal, sample_rate = librosa.load(audio_file, sr=16000)
start = 23196064
end = 23364576
src_signal_demo = src_signal[start:end]
prompt = 'ไปฅไธๆฏๆฎ้่ฏ็ๅฅๅญ'
prompt_ids = pipe.tokenizer.get_prompt_ids(prompt, return_tensors="pt")
result = pipe(src_signal_demo, generate_kwargs={"language": "zh", "task": "transcribe", "prompt_ids": prompt_ids})
print(result["text"])
Who can help?
No response
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
- load the audio
- slice the audio
- add the prompt
- transcribe the slice audio, then occur error.
Expected behavior
the audio can transform to the context.
cc @gante @sanchit-gandhi
Hey @diaojunxian ๐
Your reproducer contains private data, which means we can't easily reproduce on our end -- would you be able to share the audio file with us OR rewrite the reproducer from public data?
At a first glance, because of the thrown exception (IndexError: index -1 is out of bounds for dimension 1 with size 0 in next_token_logits = outputs.logits[:, -1, :]), I'd bet something went wrong at preprocessing time :D bad model input shapes -> bad model output shapes
Hey @diaojunxian ๐
Your reproducer contains private data, which means we can't easily reproduce on our end -- would you be able to share the audio file with us OR rewrite the reproducer from public data?
At a first glance, because of the thrown exception (
IndexError: index -1 is out of bounds for dimension 1 with size 0innext_token_logits = outputs.logits[:, -1, :]), I'd bet something went wrong at preprocessing time :D bad model input shapes -> bad model output shapes
I can send it to you privately, but it cannot be published on the Internet. Only you can personally verify this bug. Can you see it?
@diaojunxian yeah, that would be helpful. You can send it to the email attached to my GH account ([email protected])
You are using an unmodified openai/whisper-large-v2, correct?
start = 23196064 end = 23364576
yes, unmodified whisper-large-v2, and had send the audio to the gmail.
Hey @diaojunxian ๐
In both snippets, the problem is the same: as soon as the model tries to generate beyond its maximum length, the output sequence dimension becomes 0, causing the exception.
I've found the issue and will open a PR to fix it. The second example you provided works perfectly after the fix. The first one probably will fail because of max_new_tokens=3000 (Whisper's maximum length is 448 and we default generation to its maximum length, you probably shouldn't set max_new_tokens at all :) )
After the PR linked above gets merged, you can install from main and it should work :)