Ricardo Lu comments

Results 27 comments of


                                            Ricardo Lu

What's the difference between vllm and triton-inference-server?

Thanks for your response. So can I assume vLLM will server like a backend in NVIDIA Triton? I wandering whether the serving part will be overlapped with NVIDIA Triton's capabilities?

triton aiohttp client report "Timeout context manager should be used inside a task" error

@DequanZhu Hi, have you solved this issue? I also meet this issue, and I can't even successfully call it once...

triton aiohttp client report "Timeout context manager should be used inside a task" error

@nnshah1 any progress of this issue?

triton aiohttp client report "Timeout context manager should be used inside a task" error

After further investigation, this is because the `triton_client` is not in the same event_loop of flask app. Make sure both triton aio client and flask app are in the same...

Can not run vllm with docker

I tried on this docker images 'nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04', and it works fine. Or maybe you can run `pip install vllm` before run `pip install -e .`.

Support Multiple Models

I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it...

Support `ChatCompletion` Endpoint in OpenAI demo server

I noticed that FastChat already have a [vllm_worker](https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py), so I think we could just use FastChat as frontend which provided OpenAI-Compatible APIs and vLLM as backend.

Support `ChatCompletion` Endpoint in OpenAI demo server

I didn't run vllm_worker succeed yesterday and I found it use a older version of vllm, so I implement a ChatCompletion API at (#330 )

feat: add ChatCompletion endpoint in OpenAI demo server.

> The vLLM integration for OpenAI API server has been fixed by [lm-sys/FastChat#1835](https://github.com/lm-sys/FastChat/pull/1835). Could you test it? It should be compatible with (completion, chat-completion) x (streaming, non-streaming) > > With...

[Server] use get_conversation_template to make model template

A lots of models still not merge to FastChat yet, but some of them may have the same Conversation templeate of previous, I don't think this is a flexible change...