aphrodite-engine
aphrodite-engine copied to clipboard
[Bug]: FLASHINFER - chunked_prefill crashes when multiple concurrent requests happen
Your current environment
In a single user mode with a single request, chunked prefill will work on FLASHINFER and I am able to hit 160k FP8 context.
When multiple concurrent requests come in, it crashes saying its not supported with FLASHINFER.
However, without chunked prefill my 132k FP8 context goes down to 15k FP8 context, making flashinfer useless to me.
With FLASH ATTENTION 2 I can hit over 60k FP16 context, but cannot use FP8 because its not supported on FLASH ATTENTION 2.
Is there any way to get FLASHINFER and chunked_prefill fixed? Or get quant k,v cache supported on FLASH ATTENTION 2?
Thank you!
Model Input Dumps
.
🐛 Describe the bug
.