[DataLoader worker exited unexpectedly]How to correctly set parameters to avoid thread conflict issues?
I am encountering an issue related to thread conflicts while running my training script, and I would greatly appreciate any assistance or insights the community could provide.
My Training Environment:
A800 (80G) * 8 Key Parameters Configuration:
- Model: qwen2.5-7b
- max_prompt_length=$((1024 * 2))
- max_response_length=$((1024 * 10))
- sp_size=4
- use_dynamic_bsz=True
- offload=True
- gen_tp=4
- trainer.n_gpus_per_node=8
- ray.remote(num_cpus=1)
- data_loader(num_workers=2, and also tried 0)
Could these parameter settings be causing the thread conflict issues I'm experiencing? If so, how should I adjust these parameters to resolve this problem?
Thank you very much for your time and help!
`
...
(main_task pid=3042881) "'val/math_dapo/acc/maj@16/std': 0.007240567076160023, "
(main_task pid=3042881) "'val/math_dapo/acc/best@32/mean': 0.04183333333333333, "
(main_task pid=3042881) "'val/math_dapo/acc/best@32/std': 0.03223090754862865, "
(main_task pid=3042881) "'val/math_dapo/acc/worst@32/mean': 0.0, 'val/math_dapo/acc/worst@32/std': "
(main_task pid=3042881) "0.0, 'val/math_dapo/acc/maj@32/mean': 0.00030000000000000003, "
(main_task pid=3042881) "'val/math_dapo/acc/maj@32/std': 0.004023039702127951}")
(main_task pid=3042881) step:0 - val/math_dapo/score/mean@32:-0.996 - val/math_dapo/score/std@32:0.023 - val/math_dapo/score/best@2/mean:-0.993 - val/math_dapo/score/best@2/std:0.030 - val/math_dapo/score/worst@2/mean:-1.000 - val/math_dapo/score/worst@2/std:0.002 - val/math_dapo/score/maj@2/mean:-0.996 - val/math_dapo/score/maj@2/std:0.023 - val/math_dapo/score/best@4/mean:-0.986 - val/math_dapo/score/best@4/std:0.041 - val/math_dapo/score/worst@4/mean:-1.000 - val/math_dapo/score/worst@4/std:0.000 - val/math_dapo/score/maj@4/mean:-0.996 - val/math_dapo/score/maj@4/std:0.022 - val/math_dapo/score/best@8/mean:-0.973 - val/math_dapo/score/best@8/std:0.053 - val/math_dapo/score/worst@8/mean:-1.000 - val/math_dapo/score/worst@8/std:0.000 - val/math_dapo/score/maj@8/mean:-0.998 - val/math_dapo/score/maj@8/std:0.017 - val/math_dapo/score/best@16/mean:-0.949 - val/math_dapo/score/best@16/std:0.065 - val/math_dapo/score/worst@16/mean:-1.000 - val/math_dapo/score/worst@16/std:0.000 - val/math_dapo/score/maj@16/mean:-0.998 - val/math_dapo/score/maj@16/std:0.014 - val/math_dapo/score/best@32/mean:-0.916 - val/math_dapo/score/best@32/std:0.064 - val/math_dapo/score/worst@32/mean:-1.000 - val/math_dapo/score/worst@32/std:0.000 - val/math_dapo/score/maj@32/mean:-0.999 - val/math_dapo/score/maj@32/std:0.008 - val/math_dapo/acc/mean@32:0.002 - val/math_dapo/acc/std@32:0.012 - val/math_dapo/acc/best@2/mean:0.004 - val/math_dapo/acc/best@2/std:0.015 - val/math_dapo/acc/worst@2/mean:0.000 - val/math_dapo/acc/worst@2/std:0.001 - val/math_dapo/acc/maj@2/mean:0.002 - val/math_dapo/acc/maj@2/std:0.011 - val/math_dapo/acc/best@4/mean:0.007 - val/math_dapo/acc/best@4/std:0.021 - val/math_dapo/acc/worst@4/mean:0.000 - val/math_dapo/acc/worst@4/std:0.000 - val/math_dapo/acc/maj@4/mean:0.002 - val/math_dapo/acc/maj@4/std:0.011 - val/math_dapo/acc/best@8/mean:0.013 - val/math_dapo/acc/best@8/std:0.027 - val/math_dapo/acc/worst@8/mean:0.000 - val/math_dapo/acc/worst@8/std:0.000 - val/math_dapo/acc/maj@8/mean:0.001 - val/math_dapo/acc/maj@8/std:0.008 - val/math_dapo/acc/best@16/mean:0.026 - val/math_dapo/acc/best@16/std:0.032 - val/math_dapo/acc/worst@16/mean:0.000 - val/math_dapo/acc/worst@16/std:0.000 - val/math_dapo/acc/maj@16/mean:0.001 - val/math_dapo/acc/maj@16/std:0.007 - val/math_dapo/acc/best@32/mean:0.042 - val/math_dapo/acc/best@32/std:0.032 - val/math_dapo/acc/worst@32/mean:0.000 - val/math_dapo/acc/worst@32/std:0.000 - val/math_dapo/acc/maj@32/mean:0.000 - val/math_dapo/acc/maj@32/std:0.004
Traceback (most recent call last):
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 201, in <module>
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 68, in main
File "/mnt/tenant-home_speed/zjh/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 90, in run_ppo
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/worker.py", line 2659, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::main_task() (pid=3042881, ip=192.169.76.90)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 3052708) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
The above exception was the direct cause of the following exception:
ray::main_task() (pid=3042881, ip=192.169.76.90)
File "/verl-gm-tyx-puffin-main/verl/trainer/main_ppo.py", line 197, in main_task
File "/verl-gm-tyx-puffin-main/verl/trainer/ppo/ray_trainer.py", line 861, in fit
for batch_dict in self.train_dataloader:
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 406, in __iter__
self._iterator = self._get_iterator()
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 387, in _get_iterator
it = _StatefulMultiProcessingDataLoaderIter(self, self.next_iter_state)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1049, in __init__
self._reset(loader, first_iter=True, prime_prefetch=next_iter_state is None)
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1130, in _reset
_, data = self._get_data()
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1379, in _get_data
success, data = self._try_get_data()
File "/usr/local/miniconda3/envs/logicRL/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 1228, in _try_get_data
raise RuntimeError(f"DataLoader worker (pid(s) {pids_str}) exited unexpectedly") from e
RuntimeError: DataLoader worker (pid(s) 3052708) exited unexpectedly`
same issue, have you solved it?
same issue, have you solved it?
I've running this training inside a docker and encountered the same question. I solved this question by change /dev/shm from 64M to 512G.(64M is default size set by docker)