llama.cpp Misc. bug: ggml-backend.cpp:746: pre-allocated tensor (cache_k_l0 (view) (copy of cache_k_l0 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)

Name and Version

llama-server.exe --version version: 4764 (7ad0779f) built with MSVC 19.42.34436.0 for x64

Operating systems

No response

Which llama.cpp modules do you know to be affected?

No response

Command line

llama-server.exe -m %file_path_16b% --no-mmap -fa -ctk q4_0 -c 8192 -np 2 -ngl 50 --temp 0.6 -t 10 -tb 8 -C FF000 --no-perf --host 0.0.0.0 --port 3000

Problem description & steps to reproduce

prompt eval time = 16975.44 ms / 282 tokens ( 60.20 ms per token, 16.61 tokens per second) eval time = 2257.84 ms / 28 tokens ( 80.64 ms per token, 12.40 tokens per second) total time = 19233.28 ms / 310 tokens srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 srv update_slots: all slots are idle srv params_from_: Chat format: Content-only slot launch_slot_: id 1 | task 1773 | processing task slot update_slots: id 1 | task 1773 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 1426 slot update_slots: id 1 | task 1773 | kv cache rm [64, end) slot update_slots: id 1 | task 1773 | prompt processing progress, n_past = 1426, n_tokens = 1362, progress = 0.955119 slot update_slots: id 1 | task 1773 | prompt done, n_past = 1426, n_tokens = 1362 D:\a\llama.cpp\llama.cpp\ggml\src\ggml-backend.cpp:746: pre-allocated tensor (cache_k_l0 (view) (copy of cache_k_l0 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)

[process exited with code 3221226505 (0xc0000409)]

First Bad Commit

Please help to resolve the error:

pre-allocated tensor (cache_k_l0 (view) (copy of cache_k_l0 (view))) in a buffer (Vulkan0) that cannot run the operation (CPY)

Relevant log output

Feb 24 '25 04:02 simonchen

Flash attention is only supported by Vulkan for Nvidia GPUs with a driver that's new enough to support VK_NV_cooperative_matrix2. Last I checked only a beta driver supports that. What does the device info string say that the program outputs when loading your GPU?

The problem is probably related to k-quantization, but I don't know exactly which configurations are supported and which are not, currently.

Feb 25 '25 14:02 0cc4m

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 11 '25 01:04 github-actions[bot]