Critical Bug: OSError: [Errno 12] on ray.init()
Environment Details
Verl Version: [0.5.0.dev]
Ray Version: 2.48.0
Hydra Version: 1.3.2
Python Version: 3.12.11
Operating System: Linux
Kernel Version: x86_64 GNU/Linux
Dependencies: The project is also dependent on vllm.
Exhaustive Diagnostic Steps Taken We have conducted an extensive investigation and have ruled out all standard resource limitations. The environment is not resource-constrained.
[✅] System Memory: free shows ~2TB of available RAM.
[✅] Container/Cgroup Memory Limit: /sys/fs/cgroup/memory/memory.limit_in_bytes shows no effective limit (~2TB).
[✅] User ulimit: ulimit -a shows max memory size and virtual memory are both unlimited.
[✅] Shared Memory (/dev/shm): df -h /dev/shm shows a size of 320G.
[✅] Kernel Map Count: cat /proc/sys/vm/max_map_count shows a high value of 655300.
[✅] PID Limits: Both cgroup pids.max (>3,000,000) and ulimit -u (102400) are extremely high and not a factor.
Crucially, the following isolation tests were performed:
[✅] Minimal Ray Test: A simple script with only import ray; ray.init() succeeds without error.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "~/verl/verl/trainer/main_ppo.py", line 380, in <module>
main()
File "/home/abc/.local/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/home/abc/.local/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/home/abc/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/abc/.local/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "~/verl/verl/trainer/main_ppo.py", line 40, in main
run_ppo(config)
File "~/verl/verl/trainer/main_ppo.py", line 60, in run_ppo
ray.init(
File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/worker.py", line 1929, in init
connect(
File "/usr/local/python3.12.11/lib/python3.12/site-packages/ray/_private/worker.py", line 2402, in connect
faulthandler.enable(all_threads=False)
OSError: [Errno 12] Cannot allocate memory
same error, @xuetf did you resolved it?
the same error
Try following steps:
- Execute ulimit -l unlimited on your machine.
- Add -X faulthandler in start command. for example: python3 -X faulthandler -m verl.trainer.main_ppo xxx...
- Comment out 'faulthandler.enable()' in 'scheduler.py' of sglang.
Try following steps:
- Execute ulimit -l unlimited on your machine.
- Add -X faulthandler in start command. for example: python3 -X faulthandler -m verl.trainer.main_ppo xxx...
- Comment out 'faulthandler.enable()' in 'scheduler.py' of sglang.
thank you. Step 2 is useful for me in H200 devices.