transformers Wrong calculation of the step size for the overlapping inference in the distill whisper model

System Info

transformers version: 4.39.0.dev0
Platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.2
Safetensors version: 0.4.1
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @Narsil

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

import torch

import IPython.display as ipd
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

Expected behavior

In this line of the code https://github.com/huggingface/transformers/blob/0290ec19c901adc0f1230ebdccad11c40af026f5/src/transformers/pipelines/automatic_speech_recognition.py#L62 the calculation of the step for the overlapping inference is done in the wrong way because it produces 2 * (stride_left + stride_right) of the total overlapping, but the intended overlapping should be stride_left + stride_right with stride_left for the overlapping with the left chunk and stride_right for the overlapping with the right chunk

Consider, for example, chunk_len = 15, stride_left = 3, stride_right = 3, and step = 15-3-3 = 9. 0 chunk: start=0, end=15, 1 chunk: start=9, end=24, 2 chunk: start=18, end=33

For 1 chunk, the overlap with the 0 chunk is 6, and the overlap with the 2 chunk is 6, totaling 12 of overlap, but the intended overlap was 3 with the left chunk and 3 with the right chunk, totaling 6 of overlap.

Mar 11 '24 19:03 systemdevart

cc @kamilakesbi

May 07 '24 08:05 amyeroberts

Hi @systemdevart,

Thank you for this question!

Here, stride_left indicates the overlap between the current and left chunk when already considering that stride_right samples are not in the left chunk anymore.

This image from this blog shows it visually:

In this image, both stride_left and stride_right are set to 3, resulting in a total 6 overlap between successive chunks.

If you want to have an overlap of 3 between successive chunks, you could for example fix stride_left to 3 and stride_right to 0.

cc @sanchit-gandhi

May 10 '24 08:05 kamilakesbi