Leverage VLLM batching to lower hardware requirements and improve speed
The current inference system launches 8 separate VLLM instances (one per GPU) but underutilizes VLLM's native batching capabilities. Each query is assigned to a single VLLM instance in a round-robin fashion, effectively processing one request at a time per instance. This approach:
- Wastes computational resources as VLLM can handle multiple concurrent requests internally
- Makes the system unusable for users with limited GPUs (e.g., single GPU setups)
- Creates unnecessary overhead from running multiple server processes
There are multiple possible solutions, either with a single VLLM instance with tensor parallelism or a more hybrid setup with a couple of instances while still leveraging batching. This all needs more benchmarking but I really feel we could optimize this for users with smaller hardware configurations and speed in general.
vllm支持部署Tongyi-DeepResearch-30B-A3B模型吗?
Hi bebe!!!
How’s Paris this time of the year????
X
On Mon, Sep 29, 2025 at 11:28 PM peter @.***> wrote:
PeterXiaTian left a comment (Alibaba-NLP/DeepResearch#118) https://github.com/Alibaba-NLP/DeepResearch/issues/118#issuecomment-3349801443 image.png (view on web) https://github.com/user-attachments/assets/2a1270c2-ff49-44d1-a63b-b5557473e971 vllm部署现在还不支持吗?
— Reply to this email directly, view it on GitHub https://github.com/Alibaba-NLP/DeepResearch/issues/118#issuecomment-3349801443, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB53OBFNMA2NLFGOMKWTRFL3VH2EBAVCNFSM6AAAAACG4PQPPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBZHAYDCNBUGM . You are receiving this because you commented.Message ID: @.***>