Ricardo Lu
Ricardo Lu
Thanks for your response. So can I assume vLLM will server like a backend in NVIDIA Triton? I wandering whether the serving part will be overlapped with NVIDIA Triton's capabilities?
@DequanZhu Hi, have you solved this issue? I also meet this issue, and I can't even successfully call it once...
@nnshah1 any progress of this issue?
After further investigation, this is because the `triton_client` is not in the same event_loop of flask app. Make sure both triton aio client and flask app are in the same...
I tried on this docker images 'nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04', and it works fine. Or maybe you can run `pip install vllm` before run `pip install -e .`.
I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it...
I noticed that FastChat already have a [vllm_worker](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py), so I think we could just use FastChat as frontend which provided OpenAI-Compatible APIs and vLLM as backend.
I didn't run vllm_worker succeed yesterday and I found it use a older version of vllm, so I implement a ChatCompletion API at (#330 )
> The vLLM integration for OpenAI API server has been fixed by [lm-sys/FastChat#1835](https://github.com/lm-sys/FastChat/pull/1835). Could you test it? It should be compatible with (completion, chat-completion) x (streaming, non-streaming) > > With...
A lots of models still not merge to FastChat yet, but some of them may have the same Conversation templeate of previous, I don't think this is a flexible change...