leikareipa

Results 15 comments of leikareipa

I had a play with this on Nvidia Playground a few days ago but the results were a bit questionable? Like even they were having trouble setting up their thing.

With it now supported in Ollama 0.1.28, I'm seeing similar questionable generation as on the Nvidia Playground, but I'd say worse. For example, `$ ./ollama run starcoder2:15b-q4_K_M "Write a JavaScript...

Using better completion-style prompts gave better results, though the prompts really have to be massaged sometimes or the output is way off. The model also never stops when it should,...

This is with 12 GB of VRAM total so all versions max it out, the bigger the model the more of it is CPU side. Should be the same context...

No flash attention and no KV cache quantization, all settings but context length should be default. I ran my five-test bench on Ollama's FP16 version and it got 50%, same...

Thanks for the idea, I'll see if someone beats me to testing it since it would be even slower. I assume you could just disable GPU compute altogether with `CUDA_VISIBLE_DEVICES=-1`?...

Did another test with Q4 vs Q8, and now also vs Q8 without GPU. The prompt (backticks escaped for formatting reasons here but not in the original): ``` \`\`\`js //...

Q4, Q8 and FP16 pulled via Ollama, but the weights on Ollama were updated about a day after release, the Q4 I have is pre-update. I think somebody found Qwen...

If the post-update weights perform worse than the pre-update weights then wouldn't you say it's a problem all the same. But would be interesting to see what results others are...

Not sure it's useful to get meta about who's to blame, this is 100% usage within Ollama and the issue seemingly hasn't been reported outside of Ollama so for now...