aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Bug]: FLASHINFER - chunked_prefill crashes when multiple concurrent requests happen

Open frenzybiscuit opened this issue 10 months ago • 0 comments

Your current environment

In a single user mode with a single request, chunked prefill will work on FLASHINFER and I am able to hit 160k FP8 context.

When multiple concurrent requests come in, it crashes saying its not supported with FLASHINFER.

However, without chunked prefill my 132k FP8 context goes down to 15k FP8 context, making flashinfer useless to me.

With FLASH ATTENTION 2 I can hit over 60k FP16 context, but cannot use FP8 because its not supported on FLASH ATTENTION 2.

Is there any way to get FLASHINFER and chunked_prefill fixed? Or get quant k,v cache supported on FLASH ATTENTION 2?

Thank you!

Model Input Dumps

.

🐛 Describe the bug

.

frenzybiscuit avatar Mar 16 '25 18:03 frenzybiscuit