c0008 comments

Results 6 comments of


                                            c0008

Memory allocation or estimation problem

I ran into the same problem after I enabled flash attention. I have installed two graphic cards (2x 16GB) and the VRAM usage reported by Ollama is too high which...

Memory allocation or estimation problem

When I use Qwen3 32B Q5_KM with Ollama it limits me to a context length of 14000 before offloading is used. At this point I still have 7GB of 32GB...

Ollama ps says 22 GB, but nvidia-smi says 16GB with flash attention enabled

The overestimation must come from a flawed memory calculation for the KV-cache. The more context you use the more off the numbers become. With a small model and long context...

Using Qwen2.5-Coder with LM Studio

I got it working now by using the setting "Completion API" instead of "Chat Completion API". It works because this setting does not send an empty system prompt. It would...

Using Qwen2.5-Coder with LM Studio

> I noticed that the default prompt template is just incorrect because they don't have a preset for Qwen. A correct prompt template should come with the gguf or mlx...

When will minimax-m2 be supported?

Is interleaved thinking supported? It can improve results by a lot. https://aigazine.com/industry/minimax-m2-gets-40-performance-boost-with-interleaved-thinking--ms