Whisper.cpp consumes unusually large amounts of system memory when transcribing very long wave files
Bug Description
I have a pcm_16le-encoded 16kHz wave file (about 3 GiB in size), extracted from a video that's approximately 10 hours long (https://www.twitch.tv/videos/2201134431), that I would like to pass to whisper for transcription, using the tiny model. However, attempts to transcribe this file result in the machine running out of memory, and the process either being killed, or the machine requiring a restart.
Machine Specifications
CPU: Intel Core i3 - 4130T Memory: 8GB + 8GB Swap OS Version: Arch Linux 6.9.9 Whisper.cpp Version: 1.6.2-1
Replication Steps:
- Obtain a very long audio file (for reference, I'm using the audio from this video, which I've converted to a wave file). To avoid corruption, a 64-bit header was used for the wav file.
- Attempt to output an srt/vtt transcription from the file, using the
tinymodel. - whisper.cpp eats up all the RAM and swap, and is either killed, or effectively crashes the machine.
Alternatives/Workarounds Tried
One workaround that I've tried that seems to work, is to break the file up into smaller segments, and process the segments individually. However, this requires some more complicated than the relatively simple 1 - 2 lines. Splitting by time risks interrupting someone talking, and splitting by silence requires more complex scripting, if it's possible to find a large enough gap to split by to begin with. Splitting also causes problems when using whisper.cpp to create subtitles for a video, as the segments would all need to have their timings adjusted when recombined into a single file that can be applied to the original video.
A variant of the above is to split the video file instead of just the extracted audio, which would resolve the timing issue, but that would introduce additional overhead and coding complexity, as the segments would need to have the subtitles added, then be recombined in the same sequence as the original video.
Throwing my experience in as well. I have a similar issue, even when using GPU (GTX 1070) for transcription. The GPU is properly utilized once the transcription actually starts, but for long-running audio, it consumes an immense amount of system memory in the pre-transcription phase. On the GPU, it only consumes about 4.8GB of vram.
The video from which the audio is being transcribed is 24 hours long, but was recorded in 160p so it's only about 2.5GB in size. I kept running out of ram and finally got the transcription to succeed after quite a bit of trial and error, wherein I ended up with a 41GB swap file in addition to the 8GB of system memory. While watching in htop I can see both memory and swap usage balloon to consume just about all of it .
This much memory usage seems excessive, but then again I have no clue what it's doing under the hood so maybe it's normal.
model: medium (doesn't seem to matter particularly which model I use)
About 30GB of SSD space is used whenever I transcribe something using the large-v2 model, the only way to get that space back is to restart my Mac. If there's any way to reclaim that space without restarting, that'll be great.
whisper.cpp version 1.7.1
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v2.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1280
whisper_model_load: n_text_head = 20
whisper_model_load: n_text_layer = 32
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: Metal total size = 3093.99 MB
whisper_model_load: model size = 3093.99 MB
whisper_backend_init_gpu: using Metal backend
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
whisper_backend_init: using BLAS backend
whisper_mel_init: n_len = 6000, n_len_org = 6000, n_mel = 80
whisper_init_state: kv self size = 251.66 MB
whisper_init_state: kv cross size = 251.66 MB
whisper_init_state: kv pad size = 7.86 MB
whisper_init_state: loading Core ML model from 'models/ggml-large-v2-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...
whisper_init_state: Core ML model loaded
whisper_init_state: compute buffer (conv) = 10.21 MB
whisper_init_state: compute buffer (cross) = 16.93 MB
whisper_init_state: compute buffer (decode) = 215.82 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | CUDA = 0 | COREML = 1 | OPENVINO = 0
Any update on this issue?
Any update on this issue?
From my testing using the manually-compiled github version as of 2025-08-27, the high RAM utilisation issue seems to have been fixed since. Though 64-bit header WAVE files do not work, files that stay under the 4 GiB limit seem to work fine now.
Anything longer requires the audio to be split, however, as whisper.cpp will otherwise not detect it properly, even with headers to make it a valid file, and will terminate.