steampunque

Results 25 comments of steampunque

> > I am guessing that RPC mode currently does not support mixed CPU and GPU offload, i.e. GPU offload only so if your models doesn't fit in the memory...

> > I am guessing that RPC mode currently does not support mixed CPU and GPU offload > > The problem is that we don't report available memory on CPU...

> > When the rpc servers crash, they cannot be restarted without doing a hard restart of the RPC subsystem (restart rpcbind, etc.) > > I believe I fixed this...

FYI update: latest release b2910 the aborted response with 3RPC mixtral full offload is no longer happening (perhaps the KV related patch was the issue). However I did more investigation...

> After further testing yesterday and today, I was able to confirm my hypothesis that a model can randomly generate two different sequences of tokens for the same text. @vnicolici...

> > [@vnicolici](https://github.com/vnicolici) This is a known issue. I discussed the prompt cache non determinism issue here : > > [@steampunque](https://github.com/steampunque) That's not what I meant. When I said "the...

Yes, I agree general tokenizers are non invertible by nature. Its possible to imagine a tokenizer with only single chararacter tokens which would be fully one-to-one (i.e. invertible). I am...

The problem is it won't be known until finished how to tokenize the prompt due to the context issue. Imagine a model generating A B C D E F in...

> > I believe the only way to robustly handle this issue is re-processing the input prompt. Its only a 10% hit on processing since it gets done with prompt...

> OK, in that case I fail to see how what you propose will fix this issue. The issue is that prompt reprocessing causes performance issues, as described earlier. I...