Training get stuck when initializing vllm V1 engine on H20-141g machine
verl version: 0.4.1.dev0/0.5.0 (both tried) vllm version: 0.8.5.post0 (since glibc version limit, cannot upgrade)
I'm trying to run examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh and recipe/retool/run_qwen2-32b_dapo.sh scripts on H20-141G machine.
And it hang after complete Loading checkpoint shards. I've tried to dig the exact code it stuck and found it _wait_for_engine_startup in vllm/v1/engine/core_client.py. Here it supposes to receive 'READY' from certain socket but receiver nothing after a long long time.
For extra information, I've tested this script on A100-80G machine and it worked well. 'READY' will be received after serveral mins.
Also, the following code worked well on H20 machine, so I don't think my vllm got any problem and here for your help.
engine_args = AsyncEngineArgs(model="/model/huggingface.co/Qwen/Qwen2.5-3B-Instruct", enforce_eager=True, ) engine = AsyncLLM.from_engine_args(engine_args)
Could you please use py-spy to see where it stucks?
Could you please use py-spy to see where it stucks?
Sure. I got stuck here with logs added in vllm packages.
py-spy result of the mentioned pid are as follow:
Since I've added some log to original vllm, line 454 of vllm core_client.py is events = poller.poll(STARTUP_POLL_PERIOD_MS)
same error, how did you solve it?
same error,
@autumnalK @zhouilu @gyy8426 Could you try VLLM_WORKER_MULTIPROC_METHOD=spawn? (c.f. https://github.com/vllm-project/vllm/issues/17676#issuecomment-3430645322)
Please make sure this is set in all the (rollout) processes.