transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Wrong calculation of the step size for the overlapping inference in the distill whisper model

Open systemdevart opened this issue 1 year ago • 2 comments

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.2
  • Safetensors version: 0.4.1
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0a0+32f93b1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @Narsil

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

import torch

import IPython.display as ipd
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

Expected behavior

In this line of the code https://github.com/huggingface/transformers/blob/0290ec19c901adc0f1230ebdccad11c40af026f5/src/transformers/pipelines/automatic_speech_recognition.py#L62 the calculation of the step for the overlapping inference is done in the wrong way because it produces 2 * (stride_left + stride_right) of the total overlapping, but the intended overlapping should be stride_left + stride_right with stride_left for the overlapping with the left chunk and stride_right for the overlapping with the right chunk

Consider, for example, chunk_len = 15, stride_left = 3, stride_right = 3, and step = 15-3-3 = 9. 0 chunk: start=0, end=15, 1 chunk: start=9, end=24, 2 chunk: start=18, end=33

For 1 chunk, the overlap with the 0 chunk is 6, and the overlap with the 2 chunk is 6, totaling 12 of overlap, but the intended overlap was 3 with the left chunk and 3 with the right chunk, totaling 6 of overlap.

systemdevart avatar Mar 11 '24 19:03 systemdevart

cc @kamilakesbi

amyeroberts avatar May 07 '24 08:05 amyeroberts

Hi @systemdevart,

Thank you for this question!

Here, stride_left indicates the overlap between the current and left chunk when already considering that stride_right samples are not in the left chunk anymore.

This image from this blog shows it visually:

Capture d’écran 2024-05-10 à 10 18 31

In this image, both stride_left and stride_right are set to 3, resulting in a total 6 overlap between successive chunks.

If you want to have an overlap of 3 between successive chunks, you could for example fix stride_left to 3 and stride_right to 0.

cc @sanchit-gandhi

kamilakesbi avatar May 10 '24 08:05 kamilakesbi