c0008

Results 6 comments of c0008

I ran into the same problem after I enabled flash attention. I have installed two graphic cards (2x 16GB) and the VRAM usage reported by Ollama is too high which...

When I use Qwen3 32B Q5_KM with Ollama it limits me to a context length of 14000 before offloading is used. At this point I still have 7GB of 32GB...

The overestimation must come from a flawed memory calculation for the KV-cache. The more context you use the more off the numbers become. With a small model and long context...

I got it working now by using the setting "Completion API" instead of "Chat Completion API". It works because this setting does not send an empty system prompt. It would...

> I noticed that the default prompt template is just incorrect because they don't have a preset for Qwen. A correct prompt template should come with the gguf or mlx...

Is interleaved thinking supported? It can improve results by a lot. https://aigazine.com/industry/minimax-m2-gets-40-performance-boost-with-interleaved-thinking--ms