llama.cpp Misc. bug: All llama executables exit immediately without console output

Name and Version

Multiple. SYCL build as well as CPU only build from git rev-parse head eb5c3dc64bd967f2e23c87d9dec195f45468de60 Also the prebuilt Windows SYCL binary 3bcd40b3c593d14261fb2abfabad3c0fb5b9e318 tag b4040 NOTE: Prebuilt binary fb76ec31a9914b7761c1727303ab30380fd4f05c tag b3038 WORKS!

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-cli, llama-server

Problem description & steps to reproduce

I originally thought this was a problem with the SYCL builds, but I also compiled CPU only with the same result. Note that "main.exe" from the old SYCL prebuilt works as expected.

Not sure if relevant but I don't recognise the OpenMP installs that were found.

Steps to reproduce:

Install or build llama-cpp (CPU/SYCL) Build via any: ./examples/sycl/win-build-sycl.bat

cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON cmake --build build --config Release -j

(Also, build warning:) icx: warning: unknown argument ignored in clang-cl: '-machine:x64' [-Wunknown-argument]

Run any llama executable (llama-cli, llama-ls-sycl-device, llama-server, etc.) Use the "Intel oneAPI command prompt for Intel 64 for Visual Studio 2022" command prompt, then run any llama exe OR Execute "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64, then run any llama exe OR Run examples\sycl\win-run-llama2.bat

Build config (CPU):

B:\LLM\llama-src\llama.cpp>cmake -B build
-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.41.34120.0
-- The CXX compiler identification is MSVC 19.41.34120.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.1.windows.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- Including CPU backend
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -openmp (found version "2.0")
-- Found OpenMP: TRUE (found version "2.0")
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
-- Adding CPU backend variant ggml-cpu: /arch:AVX2 GGML_AVX2;GGML_FMA;GGML_F16C
-- Configuring done (12.9s)
-- Generating done (0.9s)
-- Build files have been written to: B:/LLM/llama-src/llama.cpp/build

Build config (SYCL):

B:\LLM\llama-src\llama.cpp>cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON
-- The C compiler identification is MSVC 19.41.34120.0
-- The CXX compiler identification is IntelLLVM 2025.0.4 with MSVC-like command-line
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.41.34120/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/bin/icx.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: C:/Program Files/Git/cmd/git.exe (found version "2.47.1.windows.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
-- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- CMAKE_GENERATOR_PLATFORM:
-- Including CPU backend
-- Found OpenMP_C: -openmp (found version "2.0")
-- Found OpenMP_CXX: -Qiopenmp (found version "5.1")
-- Found OpenMP: TRUE (found version "2.0")
-- x86 detected
-- Performing Test HAS_AVX_1
-- Performing Test HAS_AVX_1 - Success
-- Performing Test HAS_AVX2_1
-- Performing Test HAS_AVX2_1 - Success
-- Performing Test HAS_FMA_1
-- Performing Test HAS_FMA_1 - Success
-- Performing Test HAS_AVX512_1
-- Performing Test HAS_AVX512_1 - Failed
-- Performing Test HAS_AVX512_2
-- Performing Test HAS_AVX512_2 - Failed
-- Adding CPU backend variant ggml-cpu: /arch:AVX2 GGML_AVX2;GGML_FMA;GGML_F16C
-- Performing Test SUPPORTS_SYCL
-- Performing Test SUPPORTS_SYCL - Success
-- Using oneAPI Release SYCL compiler (icpx).
-- SYCL found
-- DNNL found:1
-- Found IntelSYCL: C:/Program Files (x86)/Intel/oneAPI/compiler/latest/include (found version "202001")
-- MKL_VERSION: 2025.0.1
-- MKL_ROOT: C:/Program Files (x86)/Intel/oneAPI/mkl/latest
-- MKL_ARCH: intel64
-- MKL_SYCL_LINK: None, set to ` dynamic` by default
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_SYCL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_SYCL_THREADING: None, set to ` tbb_thread` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_scalapack_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_scalapack_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_cdft_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_cdft_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_ilp64_dll.lib
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_intel_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_intel_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_core_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_core.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_blacs_ilp64_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_blacs_ilp64.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_blas_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_blas.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_lapack_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_lapack.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_dft_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_dft.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_sparse_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_sparse.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_data_fitting_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_data_fitting.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_rng_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_rng.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_stats_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_stats.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_sycl_vm_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_sycl_vm.5.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/mkl/latest/lib/mkl_tbb_thread_dll.lib
-- Found DLL: C:/Program Files (x86)/Intel/oneAPI/mkl/latest/bin/mkl_tbb_thread.2.dll
-- Found C:/Program Files (x86)/Intel/oneAPI/compiler/latest/lib/libiomp5md.lib
-- Including SYCL backend
-- Configuring done (10.0s)
-- Generating done (0.2s)
-- Build files have been written to: B:/LLM/llama-src/llama.cpp/build

First Bad Commit

No response

Relevant log output

B:\LLM\llama-src\llama.cpp\build\bin> llama-cli --version

B:\LLM\llama-src\llama.cpp\build\bin>

B:\LLM\llama-src\llama.cpp\build\bin> llama-ls-sycl-device

B:\LLM\llama-src\llama.cpp\build\bin>

etc.

Dec 21 '24 08:12 Ikaron

What happens when you append --log-verbose to llama-server?

Unfortunately, I don't have a windows machine to test.

Dec 22 '24 04:12 qnixsynapse

What happens when you append --log-verbose to llama-server?

Unfortunately, I don't have a windows machine to test.

Sorry for the duplicated issue https://github.com/ggerganov/llama.cpp/issues/10944

--log-verbose has no effect. I don't think the code advanced to the point that it can take cmdline params. It fails much earlier.

Also, some binaries do work, such as llama-gguf.exe.

Dec 22 '24 14:12 waltersamson

If you are able to share a stack trace, it would be very helpful. The stack trace would allow us to pinpoint the issue.

Dec 23 '24 04:12 qnixsynapse

I had a similar problem a while ago if I remember correctly I had to reinstall Microsoft Visual C++ redistributable. Not sure it's the same here but you can try.

Dec 23 '24 12:12 easyfab

Can you check if #10960 solves this problem?

Dec 23 '24 17:12 slaren

Can you check if #10960 solves this problem?

Now that I see b4388 contains this commit, I tried llama-b4388-bin-win-avx2-x64.zip, and unfortunately no, it still doesn't work.

I tried running b4388 llama-server.exe with NTtrace, and got something, not sure if it'll help: nttrace.txt

Dec 24 '24 09:12 waltersamson

@Ikaron Maybe you missed to activate oneAPI running time.

Could you run following cmd before execute the tool?

call "c:\Program Files (x86)\Intel\oneAPI\setvars.bat"

Dec 27 '24 06:12 NeoZhangJianyu

I had a similar problem a while ago if I remember correctly I had to reinstall Microsoft Visual C++ redistributable. Not sure it's the same here but you can try.

I had the same situation too, and reinstall MSVC++ redist resolved this though I had installed Visual Studio 2022 community.

Dec 28 '24 13:12 hajimeg3

I updated vcredist 2015-2022 to version 14.42.34433.0 and it works for me now.

Jan 01 '25 00:01 waltersamson

I had the same situation too, and reinstall MSVC++ redist resolved this though I had installed Visual Studio 2022 community.

I updated vcredist 2015-2022 to version 14.42.34433.0 and it works for me now.

This works like a charm. Thanks a lot

Jan 03 '25 11:01 bqhuyy

Hey, I tried updating the vcredist but I've also gone through the dump (it is a crash) and found the offending line to be ggml/src/ggml-sycl/dpct/helper.hpp line 953 if (result.empty()) throw std::runtime_error("can not find preferred GPU platform"); Basically, no sycl compatible GPU is found, but this doesn't cause any command line output, leading to the behaviour. Some text like "No GPU found" would be very helpful here. The runtime error thrown never ends up showing in the console.

Feb 03 '25 10:02 Ikaron

Same problem, I've updated everything I can but nothing helps.

Feb 09 '25 09:02 libid0nes

This issue was closed because it has been inactive for 14 days since being marked as stale.

Mar 26 '25 01:03 github-actions[bot]

llama-cli --list-devices (with or without GGML_NO_BACKTRACE=1) :

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940MX, compute capability 5.0, VMM: yes
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f4cfd5107e3 in __GI___wait4 (pid=181299, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x00007f4cfd5107e3 in __GI___wait4 (pid=181299, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f4d008eb45a in ggml_print_backtrace () from sources/llama.cpp/build/bin/libggml-base.so
#2  0x00007f4d00902076 in ggml_uncaught_exception() () from sources/llama.cpp/build/bin/libggml-base.so
#3  0x00007f4cfd8bb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f4cfd8a5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f4cfd8bb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f4cfdc5d57d in dpct::dev_mgr::dev_mgr() () from sources/llama.cpp/build/bin/libggml-sycl.so
#7  0x00007f4cfdc32b60 in ggml_sycl_init() () from sources/llama.cpp/build/bin/libggml-sycl.so
#8  0x00007f4cfdc3558d in ggml_backend_sycl_reg () from sources/llama.cpp/build/bin/libggml-sycl.so
#9  0x00007f4d00b23123 in ggml_backend_registry::ggml_backend_registry() () from sources/llama.cpp/build/bin/libggml.so
#10 0x00007f4d00b21565 in ggml_backend_load_best(char const*, bool, char const*) () from sources/llama.cpp/build/bin/libggml.so
#11 0x00007f4d00b200c1 in ggml_backend_load_all_from_path () from sources/llama.cpp/build/bin/libggml.so
#12 0x0000000000429f71 in common_params_parser_init(common_params&, llama_example, void (*)(int, char**)) ()
#13 0x00000000004276c2 in common_params_parse(int, char**, common_params&, llama_example, void (*)(int, char**)) ()
#14 0x000000000041a8d0 in main ()

[Inferior 1 (process 181288) detached]
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform
Aborted (core dumped)

(compiled using -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=native -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_SYCL=ON -DGGML_BACKEND_DL=OFF) I want to offload as much as possible on a 2GB CUDA Nvidia 940MX (cap 5.0) and secondarily on an HD Graphics 620 (SYCL being apparently the way to go, after pulling 5GB of Intel toolchain oneapi stuff)

$ bin/sycl-ls
[opencl:cpu][opencl:0] Portable Computing Language, cpu-haswell-Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz OpenCL 3.0 PoCL HSTR: cpu-x86_64-pc-linux-gnu-haswell [5.0+debian]
[opencl:cpu][opencl:1] Intel(R) OpenCL, Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz OpenCL 3.0 (Build 0) [2025.20.6.0.04_224945]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce 940MX 5.0 [CUDA 12.2]

So:

I do have two SYCL devices (one CUDA and one OpenCL]
The CUDA detection goes well
But SYCL one fails to detect the card and throw can not find preferred GPU platform

I would expect it to find (and use) the HD Graphics 620. Fun fact, even llama-cli --help fails (due to same code path ggml_backend_load_all_from_path())

BTW, I used sycl-ls from https://github.com/intel/llvm/releases/download/nightly-2024-06-03/sycl_linux.tar.gz since ./build/bin/llama-ls-sycl-device aborts the same way llama-cli does.

Jul 17 '25 20:07 drzraf

Hello @drzraf, your issue doesn't seem related to the original issue here since the solution was found. Please create a new issue in the future to make the discussion easier.

sycl-ls doesn't list the iGPU so you are probably missing the level_zero dependency. I would suggest to have a look at https://www.intel.com/content/www/us/en/developer/articles/guide/installation-guide-for-oneapi-toolkits.html to install the GPU drivers based on your OS. You may also want to have a look at the product page of your GPU: https://www.intel.com/content/www/us/en/support/products/96551/graphics/processor-graphics/intel-hd-graphics-family/intel-hd-graphics-620.html Note this GPU is discontinued so I am not sure you will be able to get level_zero working.

You seem to want to use 2 GPUs with a different llama backend to offload a single model. Bear in mind this is not a typical use case (at least as far as the SYCL backend is concerned), this could very well create more issues than you expect. I'm not sure how you will be able to offload different parts of the model to your devices in a way that would improve performance. If you manage to get interesting results please share them in a discussion here though! I expect you will get better performance by sticking to a single discrete GPU (in your case the Nvidia one).

Jul 18 '25 08:07 Rbiessy

your issue doesn't seem related to the original issue here since the solution was found. Please create a new issue in the future to make the discussion easier.

ok

sycl-ls doesn't list the iGPU so you are probably missing the level_zero dependency.

Sorry I hadn't sourced Intel envvars.sh

$ sycl-ls
[opencl:gpu][opencl:0] Intel(R) OpenCL, Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz OpenCL 3.0 (Build 0) [2025.20.6.0.04_224945]

(but same error from ggml_sycl_init() -> dpct::dev_mgr::dev_mgr())

You seem to want to use 2 GPUs with a different llama backend to offload a single model.

well formulated.

Bear in mind this is not a typical use case (at least as far as the SYCL backend is concerned), this could very well create more issues than you expect.

I wonder if there is anything in the codebase blocking this.

I'm not sure how you will be able to offload different parts of the model to your devices in a way that would improve performance. If you manage to get interesting results please share them in a discussion here though!

definitely

I expect you will get better performance by sticking to a single discrete GPU (in your case the Nvidia one).

Only 1950MB fit into this one. And I believe the SYCL-enabled HD Graphics 620 may do a better work than the CPU even if it's just RAM (at least, it needs to be tried)

/edit1 Using -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES=native -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON -DGGML_CUDA_FORCE_CUBLAS=ON -DLLAMA_BUILD_SERVER=ON -DGGML_SYCL=ON -DCMAKE_C_COMPILER=/opt/intel/oneapi/compiler/2025.2/bin/icx -DCMAKE_CXX_COMPILER=/opt/intel/oneapi/compiler/2025.2/bin/icpx -DGGML_BACKEND_DL=OFF I get (runtime): llama.cpp was compiled without support for GPU offload. Setting the split mode has no effect

/edit2 -DGGML_CPU=OFF and removed DGGML_BACKEND_DL=OFF and uncommented the cout in ggml/src/ggml-sycl/dpct/helper.hpp:

https://github.com/ggml-org/llama.cpp/blob/bf9087f59aab940cf312b85a67067ce33d9e365a/ggml/src/ggml-sycl/dpct/helper.hpp#L979-L989

I get back to the original issue from ggml_sycl_init()

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940MX, compute capability 5.0, VMM: yes
**platform [Intel(R) OpenCL] does not contain GPU devices, skipping**    <-----
[throw at ggml_sycl_init()]

I believe sycl-ls is detecting devices more effectively than get_preferred_gpu_platform_name(). Here is the routine location : https://github.com/intel/llvm/blob/e7edd4975383bc40ff757a339e300dbf3cab4460/sycl/tools/sycl-ls/sycl-ls.cpp#L435-L457

(FTR, environments variables)

I extracted the helper.hpp logic to better debug it and it's clearly missing devices:

system has 1 platforms
platform [Intel(R) OpenCL has 1 devices
	 * opencl:cpu
platform [Intel(R) OpenCL] does not contain GPU devices, skipping
terminate called after throwing an instance of 'std::runtime_error'
  what():  can not find preferred GPU platform

Jul 19 '25 02:07 drzraf

The intel/llvm issue you created will be more helpful but it really looks like an issue with level_zero not being installed on your system. Also note there is some logic in llama to avoid using opencl for GPU devices in https://github.com/ggml-org/llama.cpp/blob/b4efd77f8ab407836ca73a5176f041650c5b2411/ggml/src/ggml-sycl/dpct/helper.hpp#L966

The logic in that helper is to ensure users use level zero by default on Intel devices. You may be able to use opencl with your GPU by setting ONEAPI_DEVICE_SELECTOR=opencl:gpu but that's not supported.

Jul 21 '25 09:07 Rbiessy

After installing intel-opencl-icd (from Intel ppa), it's gets better (I also discovered that SYCL_CONFIG_FILE_NAME environment variable):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce 940MX, compute capability 5.0, VMM: yes
load_backend: loaded CUDA backend from build/bin/libggml-cuda.so
platform [Intel(R) OpenCL] does not contain GPU devices, skipping
platform [Intel(R) OpenCL Graphics] contains GPU devices, skipping
load_backend: loaded SYCL backend from build/bin/libggml-sycl.so
Available devices:
  CUDA0: NVIDIA GeForce 940MX (2002 MiB, 1974 MiB free)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
  SYCL0: Intel(R) HD Graphics 620 (29588 MiB, 29588 MiB free)

I had a problem of

llama_model_load: error loading model: make_cpu_buft_list: no CPU backend found

and -DGGML_CPU=ON failing... unless -DGGML_CPU_ALL_VARIANTS:

CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:364 (message):
  GGML_NATIVE is not compatible with GGML_BACKEND_DL, consider using
  GGML_CPU_ALL_VARIANTS
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:361 (ggml_add_cpu_backend_variant_impl)

So compiling with -DGGML_CPU_ALL_VARIANTS made it.

But loading a model still fails with

/ggml/src/ggml-backend.cpp:750: pre-allocated tensor (blk.2.attn_q.weight) in a buffer (SYCL_Split) that cannot run the operation (NONE)

full log: log.txt

Jul 25 '25 20:07 drzraf

I've not seen a similar error before. What command are you using? From your log I think you are using the model gemma-3n-E2B-it-Q5_K_M.gguf, I'm able to run it fine with the SYCL backend alone using ./bin/llama-cli -m /path/to/gemma-3n-E2B-it-Q5_K_M.gguf -ngl 99 -no-cnv -p "What are the top 10 most beautiful countries?" for instance. Your issue may be due to trying to use multiple backends. I'm afraid I won't be able to help more than that. Also note the quantization type Q5_K has not been optimized for SYCL I would recommend using Q4_0, Q4_K or potentially Q6_K.

Jul 28 '25 13:07 Rbiessy

Let me first mention what didn't work regarding device selection:

(Obviously, before anything else: . /opt/intel/oneapi/setvars.sh )

ONEAPI_DEVICE_SELECTOR='opencl:gpu;!cuda:*' didn't work
SYCL_DEVICE_ALLOWLIST=BackendName:opencl didn't work but this is probably due to sycl.conf file in the distribution which automatically presets that env variable
I got also errors related to -ngl 30

With -ngl 30, I was getting:

could not create a primitive
Exception caught at file:/ggml/src/ggml-sycl/ggml-sycl.cpp, line:2628, func:operator()
SYCL error: CHECK_TRY_ERROR(op(ctx, src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i, dev[i].row_low, dev[i].row_high, src1_ncols, src1_padded_col_size, stream)): Exception caught in this line of code.
  in function ggml_sycl_op_mul_mat at /ggml/src/ggml-sycl/ggml-sycl.cpp:2628
/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:126: SYCL error

(I think it was the no-offload at all situation)

And finally I tried --device : $ build/bin/llama-cli -m /models/gemma-3n-E2B-it-IQ4_XS.gguf -p "write a poem about turtles" -dev SYCL0

And I've two kind of errors whether -ngl X -sm row is passed or not.

Without, it's:

load_tensors: offloaded 0/31 layers to GPU load_tensors: CPU_Mapped model buffer size = 2768,73 MiB [...] onednn_verbose,v1,info,graph,backend,0:dnnl_backend

onednn_verbose,v1,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time

onednn_verbose,v1,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,implementation,backend,exec_time

onednn_verbose,v1,primitive,error,gpu,jit::gemm,Functionality is unimplemented,src/gpu/intel/jit/gemm/gen_gemm_kernel.cpp:950 [...] terminate called after throwing an instance of 'dnnl::error' what(): could not create a primitive

(stacktrace indicates ggml_sycl_mul_mat_batched_sycl > DnnlGemmWrapper::gemm)

details: no-ngl.txt

With ngl 31 (but also whatever value > 0) and GGML_SYCL_DEBUG=1 ZES_ENABLE_SYSMAN=1: $ GGML_SYCL_DEBUG=1 ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m ~/.cache/models/gemma-3n-E2B-it-IQ4_XS.gguf -p "write a poem about turtles" -dev SYCL0 -ngl 1 -sm row

load_tensors: SYCL_Split model buffer size = 2768,94 MiB load_tensors: CPU_Mapped model buffer size = 288,00 MiB [...] /ggml/src/ggml-backend.cpp:750: pre-allocated tensor (blk.0.attn_q.weight) in a buffer (SYCL_Split) that cannot run the operation (NONE)

Backtrace indicates ggml_backend_sched_backend_id_from_cur

it's fails sooner (not even load the chat template contrary to previous no-ngl attempt)
Notice the ZES_ENABLE_SYSMAN=1 warning exists even though the environment was set.

details ngl.txt

I'm sorry dears, but that's likely the deepest I could dig in. If you want me to provide better stacktrace or can suggest specific command-lines or provide a patch I'll happily apply and test though !

(I really wish we could really do the best we can with the hardware we have)

Jul 29 '25 03:07 drzraf

I can confirm that gemma-3n-E2B-it-IQ4_XS.gguf runs fine locally when offloading to a single GPU i.e.

$ cmake -Bbuild -GNinja -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON
$ cd build && ninja llama-cli
$ ./bin/llama-cli -m ~/models/gemma-3n-E2B-it-IQ4_XS.gguf -p "write a poem about turtles"  -dev SYCL0 -ngl 99

It works fine for me with other values of ngl like 30 or 1. I can confirm that -sm row fails, split modes are not well supported with SYCL. I think your other issues are related to the split mode per layer being used by default. You may want to use -sm none for debugging purposes.

Jul 29 '25 10:07 Rbiessy