Wrong calculation of the step size for the overlapping inference in the distill whisper model
System Info
-
transformersversion: 4.39.0.dev0 - Platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.20.2
- Safetensors version: 0.4.1
- Accelerate version: 0.27.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
@sanchit-gandhi @Narsil
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
import torch
import IPython.display as ipd
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v2"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
Expected behavior
In this line of the code https://github.com/huggingface/transformers/blob/0290ec19c901adc0f1230ebdccad11c40af026f5/src/transformers/pipelines/automatic_speech_recognition.py#L62 the calculation of the step for the overlapping inference is done in the wrong way because it produces 2 * (stride_left + stride_right) of the total overlapping, but the intended overlapping should be stride_left + stride_right with stride_left for the overlapping with the left chunk and stride_right for the overlapping with the right chunk
Consider, for example, chunk_len = 15, stride_left = 3, stride_right = 3, and step = 15-3-3 = 9. 0 chunk: start=0, end=15, 1 chunk: start=9, end=24, 2 chunk: start=18, end=33
For 1 chunk, the overlap with the 0 chunk is 6, and the overlap with the 2 chunk is 6, totaling 12 of overlap, but the intended overlap was 3 with the left chunk and 3 with the right chunk, totaling 6 of overlap.
cc @kamilakesbi
Hi @systemdevart,
Thank you for this question!
Here, stride_left indicates the overlap between the current and left chunk when already considering that stride_right samples are not in the left chunk anymore.
This image from this blog shows it visually:
In this image, both stride_left and stride_right are set to 3, resulting in a total 6 overlap between successive chunks.
If you want to have an overlap of 3 between successive chunks, you could for example fix stride_left to 3 and stride_right to 0.
cc @sanchit-gandhi