Samuel Oliveira Alves comments

Results 6 comments of


                                            Samuel Oliveira Alves

server: implement GLM-style MTP

The latest commits successfully integrate the MTP into the `llama_decode` architecture, maintaining the good output quality from before. The next major step is optimization. Please note: as of now, this...

server: implement GLM-style MTP

I haven't made much progress on the optimization front over the last week, aside from the small graph reuse improvement mentioned in [PR #4](https://github.com/F1LM1/llama.cpp/pull/4). Speaking of which, @ggerganov, I would...

server: implement GLM-style MTP

> I've been following this PR for a quite while and thank you for the enormous work you have done! > > I believe I saw on somewhere that the...

server: implement GLM-style MTP

> I did some quick investigations and think I've found the culprit - the use of `llama_get_logits_ith()` call in the inner loop is killing performance. > > By fixing that,...

Feature Request: Multi NUMA Tensor Parallel

> > Although one may try to solve GPU and NUMA in the same way, it is not unlikely that the approach will be different. > > The approach described...

Feature Request: Multi NUMA Tensor Parallel

> > Do you think that [ggml-org/llama.cpp#14232](https://github.com/ggml-org/llama.cpp/pull/14232) approach could help as the first steps? Sounds like a good first progress, but I didn't see many users testing and sounds like...