verl icon indicating copy to clipboard operation
verl copied to clipboard

[DataLoader worker exited unexpectedly]How to correctly set parameters to avoid thread conflict issues?

Open zhaojiahemiaomiaomiao opened this issue 10 months ago • 3 comments

I am encountering an issue related to thread conflicts while running my training script, and I would greatly appreciate any assistance or insights the community could provide.

My Training Environment:

A800 (80G) * 8 Key Parameters Configuration:

  • Model: qwen2.5-7b
  • max_prompt_length=$((1024 * 2))
  • max_response_length=$((1024 * 10))
  • sp_size=4
  • use_dynamic_bsz=True
  • offload=True
  • gen_tp=4
  • trainer.n_gpus_per_node=8
  • ray.remote(num_cpus=1)
  • data_loader(num_workers=2, and also tried 0)

Could these parameter settings be causing the thread conflict issues I'm experiencing? If so, how should I adjust these parameters to resolve this problem?

Thank you very much for your time and help!

`

...
(main_task pid=3042881)  "'val/math_dapo/acc/maj@16/std': 0.007240567076160023, "
(main_task pid=3042881)  "'val/math_dapo/acc/best@32/mean': 0.04183333333333333, "
(main_task pid=3042881)  "'val/math_dapo/acc/best@32/std': 0.03223090754862865, "
(main_task pid=3042881)  "'val/math_dapo/acc/worst@32/mean': 0.0, 'val/math_dapo/acc/worst@32/std': "
(main_task pid=3042881)  "0.0, 'val/math_dapo/acc/maj@32/mean': 0.00030000000000000003, "
(main_task pid=3042881)  "'val/math_dapo/acc/maj@32/std': 0.004023039702127951}")
(main_task pid=3042881) step:0 - val/math_dapo/score/mean@32:-0.996 - val/math_dapo/score/std@32:0.023 - val/math_dapo/score/best@2/mean:-0.993 - val/math_dapo/score/best@2/std:0.030 - val/math_dapo/score/worst@2/mean:-1.000 - val/math_dapo/score/worst@2/std:0.002 - val/math_dapo/score/maj@2/mean:-0.996 - val/math_dapo/score/maj@2/std:0.023 - val/math_dapo/score/best@4/mean:-0.986 - val/math_dapo/score/best@4/std:0.041 - val/math_dapo/score/worst@4/mean:-1.000 - val/math_dapo/score/worst@4/std:0.000 - val/math_dapo/score/maj@4/mean:-0.996 - val/math_dapo/score/maj@4/std:0.022 - val/math_dapo/score/best@8/mean:-0.973 - val/math_dapo/score/best@8/std:0.053 - val/math_dapo/score/worst@8/mean:-1.000 - val/math_dapo/score/worst@8/std:0.000 - val/math_dapo/score/maj@8/mean:-0.998 - val/math_dapo/score/maj@8/std:0.017 - val/math_dapo/score/best@16/mean:-0.949 - val/math_dapo/score/best@16/std:0.065 - val/math_dapo/score/worst@16/mean:-1.000 - val/math_dapo/score/worst@16/std:0.000 - val/math_dapo/score/maj@16/mean:-0.998 - val/math_dapo/score/maj@16/std:0.014 - val/math_dapo/score/best@32/mean:-0.916 - val/math_dapo/score/best@32/std:0.064 - val/math_dapo/score/worst@32/mean:-1.000 - val/math_dapo/score/worst@32/std:0.000 - val/math_dapo/score/maj@32/mean:-0.999 - val/math_dapo/score/maj@32/std:0.008 - val/math_dapo/acc/mean@32:0.002 - val/math_dapo/acc/std@32:0.012 - val/math_dapo/acc/best@2/mean:0.004 - val/math_dapo/acc/best@2/std:0.015 - val/math_dapo/acc/worst@2/mean:0.000 - val/math_dapo/acc/worst@2/std:0.001 - val/math_dapo/acc/maj@2/mean:0.002 - val/math_dapo/acc/maj@2/std:0.011 - val/math_dapo/acc/best@4/mean:0.007 - val/math_dapo/acc/best@4/std:0.021 - val/math_dapo/acc/worst@4/mean:0.000 - val/math_dapo/acc/worst@4/std:0.000 - val/math_dapo/acc/maj@4/mean:0.002 - val/math_dapo/acc/maj@4/std:0.011 - val/math_dapo/acc/best@8/mean:0.013 - val/math_dapo/acc/best@8/std:0.027 - val/math_dapo/acc/worst@8/mean:0.000 - val/math_dapo/acc/worst@8/std:0.000 - val/math_dapo/acc/maj@8/mean:0.001 - val/math_dapo/acc/maj@8/std:0.008 - val/math_dapo/acc/best@16/mean:0.026 - val/math_dapo/acc/best@16/std:0.032 - val/math_dapo/acc/worst@16/mean:0.000 - val/math_dapo/acc/worst@16/std:0.000 - val/math_dapo/acc/maj@16/mean:0.001 - val/math_dapo/acc/maj@16/std:0.007 - val/math_dapo/acc/best@32/mean:0.042 - val/math_dapo/acc/best@32/std:0.032 - val/math_dapo/acc/worst@32/mean:0.000 - val/math_dapo/acc/worst@32/std:0.000 - val/math_dapo/acc/maj@32/mean:0.000 - val/math_dapo/acc/maj@32/std:0.004
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 201, in <module>
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 68, in main
  File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 90, in run_ppo
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::main_task() (pid=3042881, ip=192.169.76.90)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3052708) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

The above exception was the direct cause of the following exception:

ray::main_task() (pid=3042881, ip=192.169.76.90)
  File "/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 197, in main_task
  File "/verl-gm-tyx-puffin-main/verl/trainer/ppo/ray_trainer.py", line 861, in fit
    for batch_dict in self.train_dataloader:
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 406, in __iter__
    self._iterator = self._get_iterator()
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 387, in _get_iterator
    it = _StatefulMultiProcessingDataLoaderIter(self, self.next_iter_state)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1049, in __init__
    self._reset(loader, first_iter=True, prime_prefetch=next_iter_state is None)
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1130, in _reset
    _, data = self._get_data()
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1379, in _get_data
    success, data = self._try_get_data()
  File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1228, in _try_get_data
    raise RuntimeError(f"DataLoader worker (pid(s) {pids_str}) exited unexpectedly") from e
RuntimeError: DataLoader worker (pid(s) 3052708) exited unexpectedly`

zhaojiahemiaomiaomiao avatar Mar 24 '25 08:03 zhaojiahemiaomiaomiao

same issue, have you solved it?

Qianvenh avatar Sep 03 '25 01:09 Qianvenh

same issue, have you solved it?

PolarisLiu1 avatar Sep 28 '25 11:09 PolarisLiu1

I've running this training inside a docker and encountered the same question. I solved this question by change /dev/shm from 64M to 512G.(64M is default size set by docker)

gaolei-he avatar Nov 13 '25 01:11 gaolei-he