DeepResearch Leverage VLLM batching to lower hardware requirements and improve speed

The current inference system launches 8 separate VLLM instances (one per GPU) but underutilizes VLLM's native batching capabilities. Each query is assigned to a single VLLM instance in a round-robin fashion, effectively processing one request at a time per instance. This approach:

Wastes computational resources as VLLM can handle multiple concurrent requests internally
Makes the system unusable for users with limited GPUs (e.g., single GPU setups)
Creates unnecessary overhead from running multiple server processes

There are multiple possible solutions, either with a single VLLM instance with tensor parallelism or a more hybrid setup with a couple of instances while still leveraging batching. This all needs more benchmarking but I really feel we could optimize this for users with smaller hardware configurations and speed in general.

Sep 18 '25 17:09 tobrun

Sep 19 '25 15:09 mariaholland

vllm支持部署Tongyi-DeepResearch-30B-A3B模型吗?

Sep 30 '25 02:09 PeterXiaTian

vllm部署现在还不支持吗?

Sep 30 '25 03:09 PeterXiaTian

Hi bebe!!!

How’s Paris this time of the year????

X

On Mon, Sep 29, 2025 at 11:28 PM peter @.***> wrote:

PeterXiaTian left a comment (Alibaba-NLP/DeepResearch#118) https://github.com/Alibaba-NLP/DeepResearch/issues/118#issuecomment-3349801443 image.png (view on web) https://github.com/user-attachments/assets/2a1270c2-ff49-44d1-a63b-b5557473e971 vllm部署现在还不支持吗?

— Reply to this email directly, view it on GitHub https://github.com/Alibaba-NLP/DeepResearch/issues/118#issuecomment-3349801443, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB53OBFNMA2NLFGOMKWTRFL3VH2EBAVCNFSM6AAAAACG4PQPPGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBZHAYDCNBUGM . You are receiving this because you commented.Message ID: @.***>

Sep 30 '25 04:09 mariaholland