verl Training get stuck when initializing vllm V1 engine on H20-141g machine

verl version: 0.4.1.dev0/0.5.0 (both tried) vllm version: 0.8.5.post0 (since glibc version limit, cannot upgrade)

I'm trying to run examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh and recipe/retool/run_qwen2-32b_dapo.sh scripts on H20-141G machine.

And it hang after complete Loading checkpoint shards. I've tried to dig the exact code it stuck and found it _wait_for_engine_startup in vllm/v1/engine/core_client.py. Here it supposes to receive 'READY' from certain socket but receiver nothing after a long long time.

For extra information, I've tested this script on A100-80G machine and it worked well. 'READY' will be received after serveral mins.

Also, the following code worked well on H20 machine, so I don't think my vllm got any problem and here for your help. engine_args = AsyncEngineArgs(model="/model/huggingface.co/Qwen/Qwen2.5-3B-Instruct", enforce_eager=True, ) engine = AsyncLLM.from_engine_args(engine_args)

Sep 08 '25 03:09 autumnalK

Could you please use py-spy to see where it stucks?

Sep 08 '25 10:09 vermouth1992

Could you please use py-spy to see where it stucks?

Sure. I got stuck here with logs added in vllm packages.

py-spy result of the mentioned pid are as follow:

Sep 09 '25 08:09 autumnalK

Since I've added some log to original vllm, line 454 of vllm core_client.py is events = poller.poll(STARTUP_POLL_PERIOD_MS)

Sep 09 '25 08:09 autumnalK

same error, how did you solve it?

Sep 22 '25 11:09 zhouilu

same error,

Oct 19 '25 06:10 gyy8426

@autumnalK @zhouilu @gyy8426 Could you try VLLM_WORKER_MULTIPROC_METHOD=spawn? (c.f. https://github.com/vllm-project/vllm/issues/17676#issuecomment-3430645322) Please make sure this is set in all the (rollout) processes.

Nov 17 '25 14:11 tongyx361