whisper.cpp Every run with CoreML "first run on a device may take a while ..."

I'm on macOS 14.4.1, and testing out whisper.cpp.

I have compiled binaries with both CoreML and just Metal.

But every single time I run the CoreML version it spends an age loading the coreml model version, and doesn't seem to cache for the next run

whisper_init_state: loading Core ML model from 'models/ggml-small.en-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

Runs 1-3 with command

./main -m models/ggml-small.en.bin -f samples/jfk.wav

whisper_print_timings:    total time = 25367.44 ms
whisper_print_timings:    total time = 25146.03 ms
whisper_print_timings:    total time = 25412.08 ms

Not sure what I'm doing wrong, but I was expecting it to cache whatever it was doing so subsequent runs were quicker.

May 06 '24 13:05 tarasis

I'm playing with whisper.cpp today and was impressed by the result so I wanted to compare the large model and the large one using CoreML. I'm on MacOS 14.4.1 (M1 Max) and every time I run my command it says first run, maybe I did something wrong so I'll write the whole process I followed:

1 - ./models/generate-coreml-model.sh large-v3 To generate the coreml model 2 - make large-v3 WHISPER_COREML=1 make -j To build whisper with the coreml support 3 - ./main -m models/ggml-large-v3.bin -f ./samples/myaudio.wav The command used to run the model on my audio

I tried the same command 5 times in a row and every time I had the first time message appear:

whisper_init_state: loading Core ML model from 'models/ggml-large-v3-encoder.mlmodelc'
whisper_init_state: first run on a device may take a while ...

Here are the results:

1 --> whisper_print_timings:    total time = 124125.69 ms
2 --> whisper_print_timings:    total time = 123887.73 ms
3 --> whisper_print_timings:    total time = 123487.95 ms
4 --> whisper_print_timings:    total time = 123119.12 ms
5 --> whisper_print_timings:    total time = 123199.10 ms

Example of full timings for one of the run:
whisper_print_timings:     load time =  1048.37 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     9.26 ms
whisper_print_timings:   sample time =    52.25 ms /   180 runs (    0.29 ms per run)
whisper_print_timings:   encode time =  1486.21 ms /     1 runs ( 1486.21 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =  1431.58 ms /   178 runs (    8.04 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 123199.10 ms

Maybe this is the correct behaviour and thus using the coreML version is not worth it on small samples or I'm simply doing something wrong.

May 08 '24 10:05 FrostiizGit

I'm on macOS 14.2.1 and I've run into the same problem.

May 11 '24 04:05 ceberhardt

I'm on 14.4.1 and don't observe this issue - first time is slow, but next runs load fast

May 13 '24 12:05 ggerganov

The Metal backend was incorrectly initialized 2 times, which I support might have caused this problem. Should be fixed on latest master after #2152 was just merged

May 15 '24 06:05 ggerganov

M1 pro. os14.4.1. latest master pull and make with coreml:

base.en 1st run whisper_print_timings: total time = 9904.61 ms -> 2nd run whisper_print_timings: total time = 462.63 ms. 👍
small.en 1,2,3 run no improvement ~ whisper_print_timings: total time = 24493.98 ms 🤷

--edit tried on macos 14.5 too.

May 16 '24 00:05 furyhawk

I pulled the master branch again on the same machine (OS upgrade 14.5) and tried to use CoreML.

The problem is still here for the large-v3 model below are the results (I used same commands and sample as my previous post).

Large v3:
1 -> whisper_print_timings:    total time = 119960.76 ms
2 -> whisper_print_timings:    total time = 119801.12 ms
3 -> whisper_print_timings:    total time = 118888.53 ms
4 -> whisper_print_timings:    total time = 119638.27 ms
5 -> whisper_print_timings:    total time = 119626.38 ms

However when reading the discussion I got curious and decided to test the base model (not base.en model), and it works fine ! Here are the results:

Base model:
1 -> whisper_print_timings:    total time =  9990.66 ms
2 -> whisper_print_timings:    total time =   689.29 ms
3 -> whisper_print_timings:    total time =   691.73 ms
4 -> whisper_print_timings:    total time =   689.75 ms
5 -> whisper_print_timings:    total time =   689.24 ms

Jul 07 '24 10:07 FrostiizGit

Not sure - maybe the OS somehow determines automatically when to cache or not to cache based on the size of the model

Jul 08 '24 10:07 ggerganov