cannot compie for BLAS CPU support via Intel MKL
When trying to compile with Intel MKL there is a cmake error.
Pop_OS 22.04 OneMLK 2024.2
The output:
/whisper.cpp_testing/build$ cmake -DWHISPER_MKL=ON .. -- OpenMP found -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done CMake Error at src/CMakeLists.txt:104 (add_library): Target "whisper" links to target "MKL::MKL" but the target was not found. Perhaps a find_package() call is missing for an IMPORTED target, or an ALIAS target is missing?
-- Generating done CMake Generate step failed. Build files cannot be regenerated correctly.
I have the same problem when trying to complie on Windows 10.
When I have changed src/CMakeLists.txt at 140 line for following
# if (WHISPER_MKL)
# target_link_libraries(whisper PRIVATE MKL::MKL)
# endif()
if (WHISPER_MKL)
find_package(MKL CONFIG REQUIRED PATHS $ENV{MKLROOT})
message(STATUS "Imported oneMKL targets: ${MKL_IMPORTED_TARGETS}")
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_USE_OPENBLAS)
set(WHISPER_EXTRA_FLAGS ${WHISPER_EXTRA_FLAGS} -DGGML_BLAS_USE_MKL)
target_link_libraries(whisper PRIVATE MKL::MKL)
endif()
and with MKLROOT and ONEAPI_ROOT envs set I was able to build but not sure if it works with that MKL support. I don't see any improvement but maybe it is due to my CPU or maybe that config is not enaught.
@lukaskwkw i tested your code. unfortunately unless it is MKL's fault for not making any improvement, maybe this code is really not working imo. (but why tho? I don't see any logical issue)
With normal build make I need 30s to transcript the sample file, also your modified MKL code did the same. but with OpenBLAS I get significant 20s result.
here's the log: (OpenBLAS)
$./main -m /mnt/Deb-Data/ggml-medium.bin -f /mnt/Deb-Data/whisper.cpp/samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_backend_init: using BLAS backend
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing '/mnt/Deb-Data/whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 617.11 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 15.35 ms
whisper_print_timings: sample time = 71.95 ms / 140 runs ( 0.51 ms per run)
whisper_print_timings: encode time = 14713.41 ms / 1 runs (14713.41 ms per run)
whisper_print_timings: decode time = 129.40 ms / 2 runs ( 64.70 ms per run)
whisper_print_timings: batchd time = 3077.36 ms / 136 runs ( 22.63 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 18766.82 ms
$ ./main -m /mnt/Deb-Data/ggml-medium.bin -f /mnt/Deb-Data/whisper.cpp/samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_backend_init: using BLAS backend
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing '/mnt/Deb-Data/whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 519.99 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 21.61 ms
whisper_print_timings: sample time = 79.31 ms / 140 runs ( 0.57 ms per run)
whisper_print_timings: encode time = 16832.68 ms / 1 runs (16832.68 ms per run)
whisper_print_timings: decode time = 133.07 ms / 2 runs ( 66.53 ms per run)
whisper_print_timings: batchd time = 3968.43 ms / 136 runs ( 29.18 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 21690.02 ms
(modified MKL)
$ ./main -m /mnt/Deb-Data/ggml-medium.bin -f /mnt/Deb-Data/whisper.cpp/samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing '/mnt/Deb-Data/whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 596.47 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 18.59 ms
whisper_print_timings: sample time = 96.32 ms / 140 runs ( 0.69 ms per run)
whisper_print_timings: encode time = 20833.32 ms / 1 runs (20833.32 ms per run)
whisper_print_timings: decode time = 145.09 ms / 2 runs ( 72.55 ms per run)
whisper_print_timings: batchd time = 4328.37 ms / 136 runs ( 31.83 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 26133.06 ms
$ ./main -m /mnt/Deb-Data/ggml-medium.bin -f /mnt/Deb-Data/whisper.cpp/samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing '/mnt/Deb-Data/whisper.cpp/samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 585.16 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 22.14 ms
whisper_print_timings: sample time = 83.46 ms / 140 runs ( 0.60 ms per run)
whisper_print_timings: encode time = 25622.69 ms / 1 runs (25622.69 ms per run)
whisper_print_timings: decode time = 134.98 ms / 2 runs ( 67.49 ms per run)
whisper_print_timings: batchd time = 3890.54 ms / 136 runs ( 28.61 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 30474.16 ms
(original make)
$./main -m /mnt/Deb-Data/ggml-medium.bin -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 1585.19 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 22.25 ms
whisper_print_timings: sample time = 92.06 ms / 140 runs ( 0.66 ms per run)
whisper_print_timings: encode time = 20037.21 ms / 1 runs (20037.21 ms per run)
whisper_print_timings: decode time = 143.14 ms / 2 runs ( 71.57 ms per run)
whisper_print_timings: batchd time = 3736.67 ms / 136 runs ( 27.48 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 25786.16 ms
$ ./main -m /mnt/Deb-Data/ggml-medium.bin -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/mnt/Deb-Data/ggml-medium.bin'
whisper_init_with_params_no_state: use gpu = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw = 0
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 1024
whisper_model_load: n_text_head = 16
whisper_model_load: n_text_layer = 24
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU total size = 1533.14 MB
whisper_model_load: model size = 1533.14 MB
whisper_init_state: kv self size = 150.99 MB
whisper_init_state: kv cross size = 150.99 MB
whisper_init_state: kv pad size = 6.29 MB
whisper_init_state: compute buffer (conv) = 28.55 MB
whisper_init_state: compute buffer (encode) = 594.09 MB
whisper_init_state: compute buffer (cross) = 7.72 MB
whisper_init_state: compute buffer (decode) = 141.96 MB
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | CANN = 0
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 835.28 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 18.46 ms
whisper_print_timings: sample time = 83.06 ms / 140 runs ( 0.59 ms per run)
whisper_print_timings: encode time = 23736.47 ms / 1 runs (23736.47 ms per run)
whisper_print_timings: decode time = 125.63 ms / 2 runs ( 62.82 ms per run)
whisper_print_timings: batchd time = 3801.03 ms / 136 runs ( 27.95 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 28728.32 ms
btw i think i should post it also (this happens on cmake -DWHISPER_MKL=ON ..)
MKL_VERSION: 2024.2.0
-- MKL_ROOT: /opt/intel/oneapi/mkl/2024.2
-- MKL_ARCH: None, set to ` intel64` by default
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_scalapack_ilp64.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_cdft_core.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_intel_ilp64.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_intel_thread.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_core.so
-- Found /opt/intel/oneapi/mkl/2024.2/lib/libmkl_blacs_intelmpi_ilp64.so
-- Found /opt/intel/oneapi/compiler/2024.2/lib/libiomp5.so
-- Imported oneMKL targets: MKL::mkl_scalapack_ilp64;MKL::mkl_cdft_core;MKL::mkl_intel_ilp64;MKL::mkl_intel_thread;MKL::mkl_core;MKL::mkl_blacs_intelmpi_ilp64;MKL::MKL
With the above modifications, I can build but I don't see different performance too.
Tried to build with SYCL, OpenBLAS, IntelMKL methods, and measured with a quantized model and a short wav sample. Maybe I should try a longer sample file?
Intel CPU ULTRA 165U
Hi everybody.
~~Well, @Just-Explode you have some "trick" for this? Not work for me, too.~~
Hum...
About performance. I believe this not so good enough for find best threads/processors combination.
I tried with OpenBlas build, and my Host has 8-threads/4-process (cores).
With a large-v3-turbo model, language forced (choiced) and only SRT generate.
For a good audio recorded, presentation without noises, with 2min 17sec , time for run was:
-
5min 57sec-- using4 threads, 1 processors, is default; -
3min 07sec-- using1 threads, 4 processors; -
3min 36sec-- using2 threads, 2 processors; -
3min 01sec-- using2 threads, 4 processors;
In same way, for audio with 9min 14sec :
-
23min 53sec; -
11min 09sec; -
15min 25sec; -
10min 30sec;
I believe that total of sub-process are ever Threads x Processor, for each case above.
Second point is that whisper-cpp will be split into chuncks, for processing, by exactly
number of processors that you specify, at running.
My conclusions are:
- Yeah ! ~~There are not good scaling for short audio~~
Time for processing are roughly2.5 xfor 1st, and1.0 xfor 2nd options; - Using Threads too is far from perfect.
Remembering that: Threads, usually 2 for Core, increase only 30% to 50% only in performance. In the last case, (full use of Threads and Cores), a lot of power was increased and little bit performance gained; - 3rd ~~case~~ option is just balance between power and performance, this was expected for me;
Its seem why default is not good even for a big Host, i.e. with a lot of Cores.