worker-vllm feat: align version with vllm

Gemma-2 no longer requires flashinfer - in fact, newest version of vllm has a bug in its usage, which makes the LLM return wrong tokens.

This pull requests makes it possible to use the newest vLLM build with gemma-2 models in a serverless mode.

Aug 06 '24 12:08 wwydmanski

@wwydmanski did you run any tests ? also can you add any ref: issue/bug/fix?

Aug 06 '24 16:08 pandyamarut

@pandyamarut yes, I've deployed both the version with and without the fix on Runpod Serverless. The original crashed due to kwargs incompatibility, and after fixing it gave wrong results due to flashinfer bug. The fully fixed version (this PR) is currently deployed on my dev setup and works well.

Aug 06 '24 18:08 wwydmanski

Thank you @wwydmanski . Do you mind sharing the Reproduce steps, Just what ENVs you are passing for both ? So it will easy for me to test & get this merged.

Thanks again for the PR. @wwydmanski

Aug 06 '24 18:08 pandyamarut