verl icon indicating copy to clipboard operation
verl copied to clipboard

[vllm, rollout] fix: lazy initialize ZMQ to avoid event loop error

Open JobQiu opened this issue 3 months ago โ€ข 1 comments

๐Ÿ› Problem

When running Online DPO or SPIN training, the program crashes during initialization with:

RuntimeError: no running event loop
  at verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py:575

Root Cause

The vLLMAsyncRollout.__init__() method was calling asyncio.get_running_loop() in a synchronous context (Ray worker initialization), where no event loop exists.

Call chain:

Ray Worker (sync) โ†’ vLLMAsyncRollout.__init__() โ†’ _init_zeromq() โ†’ asyncio.get_running_loop() โ†’ ๐Ÿ’ฅ Crash

According to Ray documentation, Ray actors with async methods still use synchronous __init__, not async def __init__. The event loop is only available when async methods are actually invoked.


โœ… Solution

Lazy Initialization

Delay ZeroMQ initialization until the first async method call, when an event loop is guaranteed to exist.

Changes Made

  1. Modified __init__ - Skip ZMQ initialization, set flags for lazy init
  2. Added _ensure_zmq_ready() - Async method to initialize ZMQ on first call
  3. Updated async methods - Call _ensure_zmq_ready() before accessing ZMQ

Code Changes

File: verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py

class vLLMAsyncRollout(BaseRollout):
    def __init__(self, config, model_config, device_mesh):
        super().__init__(config, model_config, device_mesh)
        self.tokenizer = self.model_config.tokenizer
        self.inference_engine = None

        # โœ… Fix: Delay ZMQ initialization
        self._zmq_initialized = False
        self.address = None
        self.socket = None
        self.zmq_loop_task = None

    async def _ensure_zmq_ready(self):
        """Ensure ZeroMQ is initialized (called lazily on first async method invocation)."""
        if not self._zmq_initialized:
            self.address = self._init_zeromq()
            self._zmq_initialized = True

    async def resume(self, tags: list[str]):
        await self._ensure_zmq_ready()  # โ† Added
        # ... original code ...

    async def release(self):
        await self._ensure_zmq_ready()  # โ† Added
        # ... original code ...

    async def update_weights(self, weights, **kwargs):
        await self._ensure_zmq_ready()  # โ† Added
        # ... original code ...

Total changes: ~20 lines modified/added


๐Ÿงช Verification

1. Unit Tests

Created comprehensive test suite in tests/test_issue_4220_fix.py:

pytest tests/test_issue_4220_fix.py -v

Test coverage:

  1. โœ… Synchronous initialization doesn't crash
  2. โœ… Lazy initialization occurs on first async call
  3. โœ… _ensure_zmq_ready() is idempotent (safe to call multiple times)
  4. โœ… get_running_loop() is NOT called during __init__

2. Minimal Reproduction

The bug can be reproduced locally without GPU/vLLM using a minimal test script:

Before fix: RuntimeError: no running event loop After fix: Initialization succeeds, ZMQ initializes on first async call

3. Integration Test (Optional)

Run actual SPIN training to verify end-to-end:

cd recipe/spin
python -m recipe.spin.main_spin \
  data.train_files=$HOME/data/gsm8k/train.parquet \
  actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
  trainer.n_gpus_per_node=1 \
  trainer.total_epochs=1

Expected:

  • โœ… No crash during initialization
  • โœ… Training proceeds normally

๐Ÿ“Š Impact Analysis

What Changes?

For Users: Nothing! The API remains exactly the same.

  • No changes to calling code
  • No changes to configuration
  • Fully backward compatible

Internally:

  • ZMQ initialization deferred by ~0.001 seconds (first async call)
  • No performance impact (initialization only happens once)

Risk Assessment

Risk Level: ๐ŸŸข Low

Why:

  • โœ… Minimal code changes (~20 lines)
  • โœ… No API changes
  • โœ… Idempotent initialization (safe to call multiple times)
  • โœ… Only affects initialization timing, not behavior
  • โœ… Verified with unit tests

Affected Components

Direct:

  • vLLMAsyncRollout class

Indirect (beneficiaries):

  • All training recipes using vLLM rollout (SPIN, DPO, GRPO, etc.)
  • Ray-based distributed training setups

๐ŸŽฏ Related Issues

Closes #4220


๐Ÿ“š References

JobQiu avatar Nov 23 '25 06:11 JobQiu

There's a break in SPIN recipe, it should inheritance AsyncActorRolloutRefWorker, which is an asyncio actor and always has an event loop. https://github.com/volcengine/verl/blob/main/recipe/spin/fsdp_workers.py#L79

wuxibin89 avatar Nov 24 '25 06:11 wuxibin89