Jeximo

Results 17 comments of Jeximo

> I used Llama cpp from langchain I see. All I can say for sure is the langchang wrapper **is not** passing the parameter as expected, and your image shows...

> EOS token = 151645 '' `` is the EOS token for `-cml` models like Qwen.

> I did use the Qwen model. What can I do? @ChaoII It worked as intended. > offloaded 0/41 layers to GPU Don't forget to add the `-ngl 99` parameter...

> i don't understand how they works because sometimes the answer is very wide Hi. BOS means beginning of sentence, and EOS means end of sentence. Usually they're **special** tokens...

It appears your model does not list `` or `` as a special token. [There's logic in llama.cpp if the token is not special](https://github.com/ggerganov/llama.cpp/issues/7049#issuecomment-2097843329). If you're able, then maybe try...

> overhead what brings the usage for sure above the available VRAM? ... model size = 66,86 GiB ... allocating 23721,00 MiB 66.86 + 2.37(_kv cache_) = 69.23, so yes.

> Or do I miss anything else which makes 3x24 GB impossible to manage this model fully? It may be possible *if you can spare a bit of system space*,...