Jeximo comments

Results 17 comments of


                                            Jeximo

Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU

> I used Llama cpp from langchain I see. All I can say for sure is the langchang wrapper **is not** passing the parameter as expected, and your image shows...

Why does every answer end with <|img end|>?

> EOS token = 151645 '' `` is the EOS token for `-cml` models like Qwen.

Why does every answer end with <|img end|>?

> I did use the Qwen model. What can I do? @ChaoII It worked as intended. > offloaded 0/41 layers to GPU Don't forget to add the `-ngl 99` parameter...

BOS ,and EOS tokens

> i don't understand how they works because sometimes the answer is very wide Hi. BOS means beginning of sentence, and EOS means end of sentence. Usually they're **special** tokens...

EOT token incorrectly set for Mistral-v0.2 trained with added ChatML tokens

It appears your model does not list `` or `` as a special token. [There's logic in llama.cpp if the token is not special](https://github.com/ggerganov/llama.cpp/issues/7049#issuecomment-2097843329). If you're able, then maybe try...

command-r-plus-104b-q5_k_s really too large for 3x24 GB ?

> overhead what brings the usage for sure above the available VRAM? ... model size = 66,86 GiB ... allocating 23721,00 MiB 66.86 + 2.37(_kv cache_) = 69.23, so yes.

command-r-plus-104b-q5_k_s really too large for 3x24 GB ?

> Or do I miss anything else which makes 3x24 GB impossible to manage this model fully? It may be possible *if you can spare a bit of system space*,...