DeepResearch icon indicating copy to clipboard operation
DeepResearch copied to clipboard

Hardcoding amount of GPUs run_react_infer.sh

Open tobrun opened this issue 5 months ago • 3 comments

This block of code forces having a setup with 8 GPUs and that GPU needs enough VRAM to host a single instance of the model:

CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --disable-log-requests &
CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --disable-log-requests &
CUDA_VISIBLE_DEVICES=2 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6003 --disable-log-requests &
CUDA_VISIBLE_DEVICES=3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6004 --disable-log-requests &
CUDA_VISIBLE_DEVICES=4 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6005 --disable-log-requests &
CUDA_VISIBLE_DEVICES=5 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6006 --disable-log-requests &
CUDA_VISIBLE_DEVICES=6 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6007 --disable-log-requests &
CUDA_VISIBLE_DEVICES=7 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6008 --disable-log-requests &

This should be refactored to a more simple and configurable setup pending on the environment of the user.

tobrun avatar Sep 18 '25 04:09 tobrun

Hi,i want to ask: it forces u to use 8 gpus?if i delete one line,it cannot run successfully?

YiJunSachs avatar Sep 18 '25 08:09 YiJunSachs

I can run it perfectly fine with a 4xRTX-6000-ADA with using:

# Server 1 uses GPU 0 and 1 together
CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --tensor-parallel-size 2 --disable-log-requests &

# Server 2 uses GPU 2 and 3 together
CUDA_VISIBLE_DEVICES=2,3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --tensor-parallel-size 2 --disable-log-requests &

main_ports=(6001 6002)

It's more about the usability of how to configure the system. I would love to resolve this issue together with https://github.com/Alibaba-NLP/DeepResearch/issues/118, lmk what you think about that issue.

tobrun avatar Sep 18 '25 17:09 tobrun

I can run it perfectly fine with a 4xRTX-6000-ADA with using:

# Server 1 uses GPU 0 and 1 together
CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --tensor-parallel-size 2 --disable-log-requests &

# Server 2 uses GPU 2 and 3 together
CUDA_VISIBLE_DEVICES=2,3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --tensor-parallel-size 2 --disable-log-requests &

main_ports=(6001 6002)

It's more about the usability of how to configure the system. I would love to resolve this issue together with #118, lmk what you think about that issue.

Great configuration, I think all modification should be done at the port both in sh file and py file, config the port correctly and everything will be fine.

zhaowenZhou avatar Sep 26 '25 09:09 zhaowenZhou