[vllm, rollout] fix: lazy initialize ZMQ to avoid event loop error

Open JobQiu opened this issue 3 months ago • 1 comments

🐛 Problem

When running Online DPO or SPIN training, the program crashes during initialization with:

RuntimeError: no running event loop
  at verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py:575

Root Cause

The vLLMAsyncRollout.__init__() method was calling asyncio.get_running_loop() in a synchronous context (Ray worker initialization), where no event loop exists.

Call chain:

Ray Worker (sync) → vLLMAsyncRollout.__init__() → _init_zeromq() → asyncio.get_running_loop() → 💥 Crash

According to Ray documentation, Ray actors with async methods still use synchronous __init__, not async def __init__. The event loop is only available when async methods are actually invoked.

✅ Solution

Lazy Initialization

Delay ZeroMQ initialization until the first async method call, when an event loop is guaranteed to exist.

Changes Made

Modified __init__ - Skip ZMQ initialization, set flags for lazy init
Added _ensure_zmq_ready() - Async method to initialize ZMQ on first call
Updated async methods - Call _ensure_zmq_ready() before accessing ZMQ

Code Changes

File: verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py

class vLLMAsyncRollout(BaseRollout):
    def __init__(self, config, model_config, device_mesh):
        super().__init__(config, model_config, device_mesh)
        self.tokenizer = self.model_config.tokenizer
        self.inference_engine = None

        # ✅ Fix: Delay ZMQ initialization
        self._zmq_initialized = False
        self.address = None
        self.socket = None
        self.zmq_loop_task = None

    async def _ensure_zmq_ready(self):
        """Ensure ZeroMQ is initialized (called lazily on first async method invocation)."""
        if not self._zmq_initialized:
            self.address = self._init_zeromq()
            self._zmq_initialized = True

    async def resume(self, tags: list[str]):
        await self._ensure_zmq_ready()  # ← Added
        # ... original code ...

    async def release(self):
        await self._ensure_zmq_ready()  # ← Added
        # ... original code ...

    async def update_weights(self, weights, **kwargs):
        await self._ensure_zmq_ready()  # ← Added
        # ... original code ...

Total changes: ~20 lines modified/added

🧪 Verification

1. Unit Tests

Created comprehensive test suite in tests/test_issue_4220_fix.py:

pytest tests/test_issue_4220_fix.py -v

Test coverage:

✅ Synchronous initialization doesn't crash
✅ Lazy initialization occurs on first async call
✅ _ensure_zmq_ready() is idempotent (safe to call multiple times)
✅ get_running_loop() is NOT called during __init__

2. Minimal Reproduction

The bug can be reproduced locally without GPU/vLLM using a minimal test script:

Before fix: RuntimeError: no running event loop After fix: Initialization succeeds, ZMQ initializes on first async call

3. Integration Test (Optional)

Run actual SPIN training to verify end-to-end:

cd recipe/spin
python -m recipe.spin.main_spin \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  trainer.n_gpus_per_node=1 \
  trainer.total_epochs=1

Expected:

✅ No crash during initialization
✅ Training proceeds normally

📊 Impact Analysis

What Changes?

For Users: Nothing! The API remains exactly the same.

No changes to calling code
No changes to configuration
Fully backward compatible

Internally:

ZMQ initialization deferred by ~0.001 seconds (first async call)
No performance impact (initialization only happens once)

Risk Assessment

Risk Level: 🟢 Low

Why:

✅ Minimal code changes (~20 lines)
✅ No API changes
✅ Idempotent initialization (safe to call multiple times)
✅ Only affects initialization timing, not behavior
✅ Verified with unit tests

Affected Components

Direct:

vLLMAsyncRollout class

Indirect (beneficiaries):

All training recipes using vLLM rollout (SPIN, DPO, GRPO, etc.)
Ray-based distributed training setups

🎯 Related Issues

Closes #4220

📚 References

Ray AsyncIO Documentation
Python asyncio Event Loop
Issue discussion: https://github.com/volcengine/verl/issues/4220

Nov 23 '25 06:11 JobQiu

There's a break in SPIN recipe, it should inheritance AsyncActorRolloutRefWorker, which is an asyncio actor and always has an event loop. https://github.com/volcengine/verl/blob/main/recipe/spin/fsdp_workers.py#L79

Nov 24 '25 06:11 wuxibin89