Ali Naeimi
Results
2
issues of
Ali Naeimi
Bug: (Speculative decoding) Massive slowdown when going past draft model's ctx size (when -cd < -c)
5
### What happened? When using speculative decoding in llama-server, when specifying different context sizes for the target model (-c) and the draft model (-cd), with the draft context being smaller...
### What happened? llama-server just won't produce anything and hangs when the draft model's KV cache CUDA memory allocation fails (OOM) whereas in mainline llama.cpp it crashes properly. on rtx...