feat: align version with vllm
Gemma-2 no longer requires flashinfer - in fact, newest version of vllm has a bug in its usage, which makes the LLM return wrong tokens.
This pull requests makes it possible to use the newest vLLM build with gemma-2 models in a serverless mode.
@wwydmanski did you run any tests ? also can you add any ref: issue/bug/fix?
@pandyamarut yes, I've deployed both the version with and without the fix on Runpod Serverless. The original crashed due to kwargs incompatibility, and after fixing it gave wrong results due to flashinfer bug. The fully fixed version (this PR) is currently deployed on my dev setup and works well.
Thank you @wwydmanski . Do you mind sharing the Reproduce steps, Just what ENVs you are passing for both ? So it will easy for me to test & get this merged.
Thanks again for the PR. @wwydmanski