llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Misc. bug: SYCL out of memory error

Open BenPortner opened this issue 1 year ago • 20 comments

Name and Version

ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no ggml_sycl_init: SYCL_USE_XMX: yes ggml_sycl_init: found 1 SYCL devices: version: 4404 (0827b2c1) built with MSVC 19.42.34435.0

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

libllama (core library)

Problem description & steps to reproduce

Problem

I run into memory errors when using the SYCL backend. No error appears when running the same setup with the VULKAN backend (same model, prompt, context length, batch size, etc.). In the example below, the error says that 568 MB could not be allocated. This is strange because I have 16 GB of GPU memory (shared system memory, not dedicated). It seems the error is not specific to llama-cli because it also occurs when I use the Python bindings (llama-cpp-python). The error also occurs in earlier versions (I tried b4311).

Hardware

Dell Latitude 5420 Windows 10 Enterprise CPU: 11th Gen Intel i7-1185G7 @ 3.00GHz, 4 Cores, 8 Logical Processors x86_64 RAM: 2x16GB Hynix 3200MHz DDR4 PC4-25600 GPU: Intel Iris Xe iGPU Storage: Western Digital PC SN530 NVMe WDC 512GB M.2 SSD

Minimum Error example

rem create very long prompt
python -c "f = open('prompt.txt', 'w'); prompt = 'bla '*40000; f.write(prompt); f.close();"

rem run llama-cli
llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt

rem complete log attached
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

First Bad Commit

No response

Relevant log output

C:\...\llama.cpp\b4404\sycl>llama-cli.exe -m "C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf" --file prompt.txt -n 20 -ngl 99 -c 40100 --no-display-prompt
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 4404 (0827b2c1) with MSVC 19.42.34435.0 for
main: llama backend init
main: load the model and apply lora adapter, if any
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Iris(R) Xe Graphics) - 14658 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 255 tensors from C:\path\to\Llama-3.2-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 28
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 3072
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 24
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-3B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   58 tensors
llama_model_loader: - type q4_0:  193 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 3
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.21 B
llm_load_print_meta: model size       = 1.78 GiB (4.77 BPW)
llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  1825.40 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
.........................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 40128
llama_new_context_with_model: n_ctx_per_seq = 40128
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (40128) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel Iris Xe Graphics|   12.0|     96|     512|   32| 15370M|            1.3.29803|
llama_kv_cache_init: kv_size = 40128, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28
llama_kv_cache_init:      SYCL0 KV buffer size =  4389.00 MiB
llama_new_context_with_model: KV self size  = 4389.00 MiB, K (f16): 2194.50 MiB, V (f16): 2194.50 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =  1983.38 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    84.38 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 40128
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

sampler seed: 3140513417
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 40128
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 40128, n_batch = 2048, n_predict = 20, n_keep = 1

alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
alloc: can't allocate 568118476 Bytes of memory on device/GPU
Enqueue process failed.
Exception caught at file:D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:3404, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *main_stream, oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
  in function ggml_sycl_mul_mat_batched_sycl at D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:3404
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:111: SYCL error

BenPortner avatar Jan 02 '25 14:01 BenPortner

Hi. Does running with reduced ctx length work such as -c 8192?

qnixsynapse avatar Jan 02 '25 16:01 qnixsynapse

@qnixsynapse No it does not because the prompt contains ~ 40k tokens.

BenPortner avatar Jan 02 '25 17:01 BenPortner

@qnixsynapse No it does not because the prompt contains ~ 40k tokens.

Try with -nkvo

qnixsynapse avatar Jan 03 '25 03:01 qnixsynapse

@qnixsynapse It works with -nkvo, although much slower (factor 3-4) than VULKAN. The main question remains: Why does it work with VULKAN, but not with SYCL*?

*(unless switching off KV offloading, which makes it very slow)

BenPortner avatar Jan 03 '25 13:01 BenPortner

@BenPortner My guess is that the KV cache size + model size increases more than the memory that has been reserved for your IGPU at >40,000 tokens (probably more than 6GB) which is causing the OOM. I think you can take a look at available GPU memory by using a program called GPU-Z in Windows and compare Buffer usages (including model sizes) for both SYCL and Vulkan backend from the logs.

qnixsynapse avatar Jan 03 '25 13:01 qnixsynapse

@qnixsynapse Thanks for the tip about GPU-Z. I will check it out. I still don't understand why the out of memory error occurs, though. Model + KV Buffer amount to ~6.5 GB. The available iGPU memory is ~15 GB as per the log output. Both model and KV buffer should easily fit. Also, both model and KV buffer are the same size when running VULKAN. So why does it work with VULKAN but not with SYCL? It seems that something memory-inefficient happens within the SYCL backend, which causes the error.

BenPortner avatar Jan 03 '25 15:01 BenPortner

It seems that something memory-inefficient happens within the SYCL

@BenPortner The kernels are definitely unoptimized in SYCL. I have plans to optimize them this year if Intel is not interested in doing so.

qnixsynapse avatar Jan 05 '25 13:01 qnixsynapse

Hi @qnixsynapse, It would be amazing to see more optimizations in the SYCL backend! It is the fastest backend on Intel iGPU after all (with small contexts / when using KV buffer offloading). I'm not a C/C++ programmer, but let me know if I can help somehow. Perhaps I can open a an issue in one of their repos? For this, we would have to be sure that the issue is not within llama.cpp/ggml though.

BenPortner avatar Jan 05 '25 17:01 BenPortner

llama3-8-int4 with big context will take more 16Gb memory. I have seen similar case on Arc 770. In this case, I suggest adjusting the parameters to make the balance.

Additionally, @qnixsynapse I also have plan to optimize the SYCL backend on Intel GPU in this year. Please go ahead if you want to do. Maybe we need to consider the impact to iGPU (like iGPU of 11/12th Core) too.

Thank you!

NeoZhangJianyu avatar Jan 07 '25 04:01 NeoZhangJianyu

@BenPortner If your issue is resolved, could you close this issue?

Thank you!

NeoZhangJianyu avatar Jan 07 '25 04:01 NeoZhangJianyu

Additionally, @qnixsynapse I also have plan to optimize the SYCL backend on Intel GPU in this year. Please go ahead if you want to do. Maybe we need to consider the impact to iGPU (like iGPU of 11/12th Core) too

Sounds great. Will love to collaborate together. Our first priority is to implement flash attention which can reduce memory usage. Vulkan backend currently has it with coopmat support.

qnixsynapse avatar Jan 07 '25 05:01 qnixsynapse

@BenPortner If your issue is resolved, could you close this issue?

Although I now better understand why this error occurs, I wouldn't call it resolved. Perhaps it is useful to keep this issue open for further discussion and coordination of the development taks? I'll leave it up to you, though. You're the devs :)

BenPortner avatar Jan 07 '25 10:01 BenPortner

Perhaps this could be relevant: https://forums.developer.nvidia.com/t/why-is-cl-device-max-mem-alloc-size-never-larger-than-25-of-cl-device-global-mem-size-only-on-nvidia/47745/10

TL;DR: The OpenCL standard somewhat arbitrarily states that CL_DEVICE_MAX_MEM_ALLOC_SIZE can never be larger than 1/4 of the actual GPU memory. "Developers can try to allocate more memory than CL_DEVICE_MAX_MEM_ALLOC_SIZE, but the successful allocation is not guaranteed (this is same for any allocation call). The developers should check for error returned by clCreateBuffer and use the allocation only if the call returns CL_SUCCESS".

I know SYCL is not the same as OpenCL, but since both are defined by the Khronos Group, perhaps the underlying limitation is the same? If yes, it might be worth ignoring this artificial limitation?

BenPortner avatar Jan 09 '25 16:01 BenPortner

That is not the problem here. In your case model weights are successfully loaded into memory. The problem happens when the gemm_batch kernel tries to calculate "batched matrix multiplication". It ends up using too much memory and fells short of 568 MB and crashes. Normally, I would prefer half of its job to be dedicated to an optimised flash attention kernel.

You can test my theory by passing --no-warmup and a smaller batch size of something like 64(by also passing -b 64 as cmdline arguments`)

qnixsynapse avatar Jan 09 '25 16:01 qnixsynapse

Hi @qnixsynapse, I kept investigating and it seems that the 4GB allocation limit is a problem for Intel+SYCL after all: https://github.com/intel/llvm/issues/10946. If I understand the issue right, then even if you fix the batch matrix multiplication, I will eventually run into OOM if llama.cpp at any point tries to allocate >4GB buffer in memory. Unless there are any safe-guards implemented against this from your side?

BenPortner avatar Jan 13 '25 09:01 BenPortner

@BenPortner Hi, that is what I understand as well given that inside SYCL, you can not set the special allocation flags needed inside the memory allocation calls to make this happen from SYCL -> OpenCL/Level Zero backends even if you can set the appropriate compiler flags for the SYCL -> OpenCL/Level Zero backends. The problem is only an issue for older Intel GPUs, integrated and discrete, based on the original Xe architecture and doesn't seem to affect Ponte Vecchio as the only exception. The issue seems to have been fixed with Battlemage/Lunar Lake Xe2 GPUs.

simonlui avatar Jan 14 '25 06:01 simonlui

I'm still trying to wrap my head around things here. The fact that llama.cpp+VULKAN backend manages to allocate >4GB buffers on my Tiger Lake iGPU just fine makes me think that this is not a hardware limitation.

@simonlui Thanks for chiming in! You mention that the >4GB buffer problem does not occur on newer Intel GPUs. Do you know why? Would they still require the "special allocation flags" to handle buffers >4GB?

@qnixsynapse Does the llama.cpp VULKAN backend somehow split buffers internally into <4GB chunks before allocating them? If not, then it seems to me that the limitation is not imposed by the hardware but by the drivers or APIs.

BenPortner avatar Jan 14 '25 14:01 BenPortner

@BenPortner From what I understand from what Intel engineers have talked about with regards to this issue, it has to do with having int64 functionality natively and being able to do memory operations like addressing fast and not wanting to hit a performance issue with translation or etc. which is a restriction inside their compute runtime/drivers. This int64 functionality seems to correspond to FP64 functionality which was missing from Intel Xe except in HPC with Ponte Vecchio hence why it seems unaffected and I think some iGPUs after Alchemist had FP64 too. They have now re-implemented FP64 in hardware and fixed this issue by simply skipping over the entire issue.

simonlui avatar Jan 14 '25 19:01 simonlui

@BenPortner If your issue is resolved, could you close this issue?

Although I now better understand why this error occurs, I wouldn't call it resolved. Perhaps it is useful to keep this issue open for further discussion and coordination of the development taks? I'll leave it up to you, though. You're the devs :)

The issue answer will help other users with same issue. If this issue is not closed for a long time, that would make other users think SYCL backend always have out of memory issue. And some users would stop trying it.

So, if we provide a workaround solution and it works in your case, please close it as possible.

For more requirement, like want SYCL backend has same memory usage by flash attention, please create a feature issue to trace it. Too long (term and content) issue, can't help we discussing.

Thank you!

NeoZhangJianyu avatar Jan 15 '25 03:01 NeoZhangJianyu

Hello @NeoZhangJianyu I would like to turn this into a feature issue but unfortunately this is very hard for me as a user. I do not know the llama.cpp code enough to locate problems. Furthermore, I don't know enough about LLM engines to propose improvements, like the flash attention mechanism you mention. For me, llama.cpp is kind of a black box. I can throw inputs at it and compare the outputs. This is enough to report problems but not enough to create a feature ticket. That being said, I'll be happy if you or any of the involved devs turn this issue into a feature ticket :)

BenPortner avatar Jan 20 '25 08:01 BenPortner

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Mar 06 '25 01:03 github-actions[bot]

I'm trying to run small models (<5GB, e.g. 7B Q4_K_M) on my 8GB Arc A770. The model gets loaded into the Shared GPU Memory, not the dedicated one. I assume this is based on the 4GB limit, loading only 23 out of the 29 layers into GPU works. Sounds like it's a architecture issue, and unless it's possible to parallel process the same model on one gpu twice there is no workaround?

.\llama-server.exe -m '\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf' -ngl 99 --ubatch-size 128 --ctx-size 128

Image

LOnstet avatar Mar 22 '25 14:03 LOnstet

I'm still getting this problem with a B580; compared to vulkan, the processing speed more than triples, but the generation speed drops from 7.3 t/s to 5.0 t/s with -nkvo. Has there been a fix or anything?

AaronBeier avatar Oct 09 '25 13:10 AaronBeier

If there is integrated GPU (iGPU) in CPU, llama.cpp SYCL backend will use both iGPU & dGPU to load the LLM and run on them. It will support more big LLM by dedicated and shared memory. But it's slower than pure dGPU.

Use following method to filter dGPU:

export ONEAPI_DEVICE_SELECTOR="level_zero:0"
or
set ONEAPI_DEVICE_SELECTOR="level_zero:0"

level-zero:0 or 1 depend on your system.

NeoZhangJianyu avatar Oct 10 '25 00:10 NeoZhangJianyu

If there is integrated GPU (iGPU) in CPU, llama.cpp SYCL backend will use both iGPU & dGPU to load the LLM and run on them. It will support more big LLM by dedicated and shared memory. But it's slower than pure dGPU.

Use following method to filter dGPU:

export ONEAPI_DEVICE_SELECTOR="level_zero:0"
or
set ONEAPI_DEVICE_SELECTOR="level_zero:0"

level-zero:0 or 1 depend on your system.

adding ONEAPI_DEVICE_SELECTOR="level_zero:0" somehow made it run better even though my amd igpu doesnt even show up in sycl, but i still get that crash when i go over ~9216 context. my 4070 12GB fits the same 12B model with 12288 context, but my B580 with sycl only fits 9216, surely a difference of ~3000 tokens is not normal, right? with vulkan it fits the 12288 context just fine.

when using sycl, sometimes it crashes with

UR backend failed. UR backend returns:40 (UR_RESULT_ERROR_OUT_OF_RESOURCES)Exception caught at file:llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:2918

and sometimes with

alloc: can't allocate 281857228 Bytes of memory on device/GPU
llama.cpp/build/bin/libggml-base.so(+0xe4d8) [0x7f56714b34d8]
llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x210) [0x7f56714b34b0]
llama.cpp/build/bin/libggml-base.so(+0x23ba6) [0x7f56714c8ba6]
/usr/lib/libstdc++.so.6(+0xb1eba) [0x7f56708b1eba]
/usr/lib/libstdc++.so.6(_ZSt10unexpectedv+0x0) [0x7f56708975d9]
/usr/lib/libstdc++.so.6(+0xb2176) [0x7f56708b2176]
llama.cpp/build/bin/libggml-sycl.so(+0x7a791) [0x7f5670c7a791]
llama.cpp/build/bin/libggml-sycl.so(+0x77ece) [0x7f5670c77ece]
llama.cpp/build/bin/libggml-sycl.so(+0x7c5e6) [0x7f5670c7c5e6]
llama.cpp/build/bin/libggml-sycl.so(+0x5e353) [0x7f5670c5e353]
llama.cpp/build/bin/libggml-sycl.so(+0x592ee) [0x7f5670c592ee]
llama.cpp/build/bin/libggml-sycl.so(+0x54c2d) [0x7f5670c54c2d]
llama.cpp/build/bin/libggml-sycl.so(+0x52467) [0x7f5670c52467]
llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x1069) [0x7f56714d5e59]
llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1) [0x7f567128b2c1]
llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x365) [0x7f567128af45]
llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x54e) [0x7f567128c78e]
llama.cpp/build/bin/libllama.so(llama_decode+0xb) [0x7f5671290c7b]
llama.cpp/build/bin/llama-server() [0x4d34ac]
llama.cpp/build/bin/llama-server() [0x450c44]
llama.cpp/build/bin/llama-server() [0x41fc7d]
/usr/lib/libc.so.6(+0x27675) [0x7f5670427675]
/usr/lib/libc.so.6(__libc_start_main+0x89) [0x7f5670427729]
llama.cpp/build/bin/llama-server() [0x41c165]
terminate called after throwing an instance of 'dnnl::error'
  what():  could not create a memory object
Aborted                    (core dumped)

i also get

llama_context: layer 0 is assigned to device SYCL0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)

which i dont know what to do about, i thought FA was a cuda/nvidia-only thing, and

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

even after adding ZES_ENABLE_SYSMAN=1. im on 6719 (aa4711d3) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) should i open a new issue for this?

AaronBeier avatar Oct 10 '25 13:10 AaronBeier

@AaronBeier It's good to create new issue for your case. I need the full log to check your issue.

Your issue is about memory usage. More Context need bigger free memory.

  1. If less layers are loaded into GPU, there are more free memory for Context.
  2. The dGPU free memory depends on the driver/system reserved.

"Flash Attention" is not supported by SYCL now. So it's handled by CPU. I will implement all missed supported OPs of SYCL.

Thank you!

NeoZhangJianyu avatar Oct 11 '25 01:10 NeoZhangJianyu