verl icon indicating copy to clipboard operation
verl copied to clipboard

Critical Bug: OSError: [Errno 12] on ray.init()

Open xuetf opened this issue 5 months ago • 4 comments

Environment Details

Verl Version: [0.5.0.dev]

Ray Version: 2.48.0

Hydra Version: 1.3.2

Python Version: 3.12.11

Operating System: Linux

Kernel Version: x86_64 GNU/Linux

Dependencies: The project is also dependent on vllm.

Exhaustive Diagnostic Steps Taken We have conducted an extensive investigation and have ruled out all standard resource limitations. The environment is not resource-constrained.

[✅] System Memory: free shows ~2TB of available RAM.

[✅] Container/Cgroup Memory Limit: /sys/fs/cgroup/memory/memory.limit_in_bytes shows no effective limit (~2TB).

[✅] User ulimit: ulimit -a shows max memory size and virtual memory are both unlimited.

[✅] Shared Memory (/dev/shm): df -h /dev/shm shows a size of 320G.

[✅] Kernel Map Count: cat /proc/sys/vm/max_map_count shows a high value of 655300.

[✅] PID Limits: Both cgroup pids.max (>3,000,000) and ulimit -u (102400) are extremely high and not a factor.

Crucially, the following isolation tests were performed:

[✅] Minimal Ray Test: A simple script with only import ray; ray.init() succeeds without error.


Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "~/verl/verl/trainer/main_ppo.py", line 380, in <module>
    main()
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/abc/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "~/verl/verl/trainer/main_ppo.py", line 40, in main
    run_ppo(config)
  File "~/verl/verl/trainer/main_ppo.py", line 60, in run_ppo
    ray.init(
  File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/worker.py", line 1929, in init
    connect(
  File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/worker.py", line 2402, in connect
    faulthandler.enable(all_threads=False)
OSError: [Errno 12] Cannot allocate memory

xuetf avatar Aug 25 '25 12:08 xuetf

same error, @xuetf did you resolved it?

VegetaPn avatar Sep 26 '25 03:09 VegetaPn

the same error

Sh1k17 avatar Oct 16 '25 11:10 Sh1k17

Try following steps:

  1. Execute ulimit -l unlimited on your machine.
  2. Add -X faulthandler in start command. for example: python3 -X faulthandler -m verl.trainer.main_ppo xxx...
  3. Comment out 'faulthandler.enable()' in 'scheduler.py' of sglang.

VegetaPn avatar Oct 16 '25 12:10 VegetaPn

Try following steps:

  1. Execute ulimit -l unlimited on your machine.
  2. Add -X faulthandler in start command. for example: python3 -X faulthandler -m verl.trainer.main_ppo xxx...
  3. Comment out 'faulthandler.enable()' in 'scheduler.py' of sglang.

thank you. Step 2 is useful for me in H200 devices.

Sh1k17 avatar Nov 17 '25 06:11 Sh1k17