Ali Naeimi issues

Repositories
Issues
Comments

Results 2 issues of


                                            Ali Naeimi

Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c)

### What happened? When using speculative decoding in llama-server, when specifying different context sizes for the target model (-c) and the draft model (-cd), with the draft context being smaller...

Bug: (Speculative decoding) llama-server hangs but doesn't crash when draft's KV allocation fails (draft KV OOM)

### What happened? llama-server just won't produce anything and hangs when the draft model's KV cache CUDA memory allocation fails (OOM) whereas in mainline llama.cpp it crashes properly. on rtx...