Hardcoding amount of GPUs run_react_infer.sh
This block of code forces having a setup with 8 GPUs and that GPU needs enough VRAM to host a single instance of the model:
CUDA_VISIBLE_DEVICES=0 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --disable-log-requests &
CUDA_VISIBLE_DEVICES=1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --disable-log-requests &
CUDA_VISIBLE_DEVICES=2 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6003 --disable-log-requests &
CUDA_VISIBLE_DEVICES=3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6004 --disable-log-requests &
CUDA_VISIBLE_DEVICES=4 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6005 --disable-log-requests &
CUDA_VISIBLE_DEVICES=5 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6006 --disable-log-requests &
CUDA_VISIBLE_DEVICES=6 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6007 --disable-log-requests &
CUDA_VISIBLE_DEVICES=7 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6008 --disable-log-requests &
This should be refactored to a more simple and configurable setup pending on the environment of the user.
Hi,i want to ask: it forces u to use 8 gpus?if i delete one line,it cannot run successfully?
I can run it perfectly fine with a 4xRTX-6000-ADA with using:
# Server 1 uses GPU 0 and 1 together
CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --tensor-parallel-size 2 --disable-log-requests &
# Server 2 uses GPU 2 and 3 together
CUDA_VISIBLE_DEVICES=2,3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --tensor-parallel-size 2 --disable-log-requests &
main_ports=(6001 6002)
It's more about the usability of how to configure the system. I would love to resolve this issue together with https://github.com/Alibaba-NLP/DeepResearch/issues/118, lmk what you think about that issue.
I can run it perfectly fine with a 4xRTX-6000-ADA with using:
# Server 1 uses GPU 0 and 1 together CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6001 --tensor-parallel-size 2 --disable-log-requests & # Server 2 uses GPU 2 and 3 together CUDA_VISIBLE_DEVICES=2,3 vllm serve $MODEL_PATH --host 0.0.0.0 --port 6002 --tensor-parallel-size 2 --disable-log-requests & main_ports=(6001 6002)It's more about the usability of how to configure the system. I would love to resolve this issue together with #118, lmk what you think about that issue.
Great configuration, I think all modification should be done at the port both in sh file and py file, config the port correctly and everything will be fine.