Gadflyii
Gadflyii
PR #16310 added new param "--no-host" to disable host buffer to allow extra buffers (Repack + AMX) to enable AMX acceleration on CPU layers when GPU(s) is present.
> You can try [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) which does not use AMX in CPU+GPU hybrid unless something changed in the past few weeks. They are still using the upstream engine from lllama.cpp...
> > You can try [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) > > i can't find a way to compile this using cuda 12.8 in linux too, rtx 50 series require minimum cuda 12.8 I...
Any plans to implement AMX then? I am a big fan of IK_llama.cpp; but have not ran it in a long time. I could try to do a run of...
I have a fork of llama.cpp they enables AMX in hybrid environments, grants about a 20-30% increase in the CPU inference portions of offloaded models if you want to try...
If you are running a hybrid CPU/GPU setup, try my llama.cpp fork, it works with amx int8/bf16 (I don’t have a 6th gen to try Int4). https://github.com/Gadflyii/llama.cpp Build as usual...
Did you build with all the AMX flags? You need: -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_AMX_TILE=ON -DGGML_AMX_INT8=ON -DGGML_AMX_BF16=ON “—AmX” is only used as a flag at run as part of the command: Example:...
I enabled discussions, I really appreciate you helping me test it. On Sat, Sep 13, 2025 at 1:17 AM ***@***.***> wrote: > Did you build with all the AMX flags?...
What doesn’t work correctly? I don’t think it is fair to say the project is dead, they made a major commit just last week to add support for qwen-next.
I hope they fix it; this is a really cool project.