leikareipa comments

Results 15 comments of


                                            leikareipa

Model request: StarCoder2

I had a play with this on Nvidia Playground a few days ago but the results were a bit questionable? Like even they were having trouble setting up their thing.

With it now supported in Ollama 0.1.28, I'm seeing similar questionable generation as on the Nvidia Playground, but I'd say worse. For example, `$ ./ollama run starcoder2:15b-q4_K_M "Write a JavaScript...

Model request: StarCoder2

Using better completion-style prompts gave better results, though the prompts really have to be massaged sometimes or the output is way off. The model also never stops when it should,...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

This is with 12 GB of VRAM total so all versions max it out, the bigger the model the more of it is CPU side. Should be the same context...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

No flash attention and no KV cache quantization, all settings but context length should be default. I ran my five-test bench on Ollama's FP16 version and it got 50%, same...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

Thanks for the idea, I'll see if someone beats me to testing it since it would be even slower. I assume you could just disable GPU compute altogether with `CUDA_VISIBLE_DEVICES=-1`?...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

Did another test with Q4 vs Q8, and now also vs Q8 without GPU. The prompt (backticks escaped for formatting reasons here but not in the original): ``` \`\`\`js //...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

Q4, Q8 and FP16 pulled via Ollama, but the weights on Ollama were updated about a day after release, the Q4 I have is pre-update. I think somebody found Qwen...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

If the post-update weights perform worse than the pre-update weights then wouldn't you say it's a problem all the same. But would be interesting to see what results others are...

QwQ 32B Preview: Q4_K_M better than Q8_0 at coding

Not sure it's useful to get meta about who's to blame, this is 100% usage within Ollama and the issue seemingly hasn't been reported outside of Ollama so for now...