verl icon indicating copy to clipboard operation
verl copied to clipboard

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.

Open fengyang95 opened this issue 1 year ago • 26 comments

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 130, in main() File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() ^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( ^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value ^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 25, in main run_ppo(config) File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 33, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=635457, ip=127.0.0.1) File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 126, in main_task trainer.fit() File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 862, in fit val_metrics = self._validate() ^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 631, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: create_colocated_worker_cls..WorkerDict actor_id: 51333b5b40d3feca28206af601000000 pid: 645217 name: 0o0QHzWorkerDict_0:6 namespace: 9b824d07-ee25-46ef-bdd4-4c993aab9272 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

fengyang95 avatar Feb 25 '25 13:02 fengyang95

same here

asirgogogo avatar Feb 27 '25 11:02 asirgogogo

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

ChaosCodes avatar Feb 28 '25 02:02 ChaosCodes

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

same problem.

yenanjing avatar Feb 28 '25 06:02 yenanjing

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in https://github.com/volcengine/verl/issues/12#issuecomment-2473822765

yenanjing avatar Feb 28 '25 08:02 yenanjing

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in #12 (comment)

I attempted to modify log_prob_micro_batch_size, but the training still gets stuck at actor_rollout_compute_log_prob.

I suspect the issue might be related to ulysses_sequence_parallel_size. When I set ulysses_sequence_parallel_size=8, the training gets stuck at actor_rollout_compute_log_prob. However, when I set ulysses_sequence_parallel_size=4, the training no longer gets stuck at actor_rollout_compute_log_prob, but it will sometimes result in an out-of-memory (OOM) error during the actor update.

ChaosCodes avatar Feb 28 '25 18:02 ChaosCodes

same error

muxixi727 avatar Mar 01 '25 14:03 muxixi727

same

Gxy-2001 avatar Mar 06 '25 09:03 Gxy-2001

same here, waiting for an official solution

Yukang-Lin avatar Mar 07 '25 06:03 Yukang-Lin

same error

yushuiwx avatar Mar 21 '25 04:03 yushuiwx

same

BUGLI27 avatar Mar 28 '25 06:03 BUGLI27

The error message is a general one from ray. Any failed ray job may start with ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. It is important to look at the trace related to verl and where it failed.

Setting log_prob_micro_batch_size_per_gpu is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issue

eric-haibin-lin avatar Apr 04 '25 20:04 eric-haibin-lin

The error message is a general one from ray. Any failed ray job may start with . It is important to look at the trace related to verl and where it failed.ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task

Setting is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issuelog_prob_micro_batch_size_per_gpu

We are currently testing multi-machine, multi-GPU training with two machines, each equipped with a single 40G A100 GPU. We plan to train and test the data based on qwen-2.5-0.5B. The same issue has occurred:

we modify the script and an error is displayed after configuring the micro_batch_size_per_gpu:

Could not override 'actor_rollout_ref.actor.micro_batch_size_per_gpu'. To append to your config use +actor_rollout_ref.actor.micro_batch_size_per_gpu=1 Key 'micro_batch_size_per_gpu' is not in struct full_key: actor_rollout_ref.actor.micro_batch_size_per_gpu object_type=dict

script:

#!/usr/bin/env bash
set -xeuo pipefail

project_name='d1'
exp_name='DAPO-Qwen2.5-0.5B'

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 20))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=0
train_prompt_bsz=2
gen_prompt_bsz=$((train_prompt_bsz * 2))
n_resp_per_prompt=4
train_prompt_mini_bsz=1

# Ray
RAY_ADDRESS=${RAY_ADDRESS:-"http://11.220.*1.**6:8265"}
WORKING_DIR=${WORKING_DIR:-"${PWD}"}
RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
NNODES=${NNODES:-2}
# Paths
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/code/verl"}
MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-0.5B"}
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7

# Performance Related Parameter
sp_size=2
use_dynamic_bsz=False # change
actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
offload=True
gen_tp=1

ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
    --working-dir "${WORKING_DIR}" \
    -- python3 -m recipe.dapo.src.main_dapo \
    data.train_files="${TRAIN_FILE}" \
    data.val_files="${TEST_FILE}" \
    data.prompt_key=prompt \
    data.truncation='left' \
    data.max_prompt_length=${max_prompt_length} \
    data.max_response_length=${max_response_length} \
    data.gen_batch_size=${gen_prompt_bsz} \
    data.train_batch_size=${train_prompt_bsz} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    algorithm.adv_estimator=${adv_estimator} \
    algorithm.use_kl_in_reward=${use_kl_in_reward} \
    algorithm.kl_ctrl.kl_coef=${kl_coef} \
    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
    actor_rollout_ref.actor.clip_ratio_c=10.0 \
    algorithm.filter_groups.enable=${enable_filter_groups} \
    algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
    algorithm.filter_groups.metric=${filter_groups_metric} \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \     # change
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
    +actor_rollout_ref.model.override_config.attention_dropout=0. \
    +actor_rollout_ref.model.override_config.embd_pdrop=0. \
    +actor_rollout_ref.model.override_config.resid_pdrop=0. \
    +actor_rollout_ref.model.override_config.torch_dtype="float16" \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
    actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.grad_clip=1.0 \
    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
    actor_rollout_ref.rollout.enable_chunked_prefill=True \
    actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
    actor_rollout_ref.rollout.temperature=${temperature} \
    actor_rollout_ref.rollout.top_p=${top_p} \
    actor_rollout_ref.rollout.top_k="${top_k}" \
    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
    reward_model.reward_manager=dapo \
    reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
    reward_model.overlong_buffer.len=${overlong_buffer_len} \
    reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
    trainer.logger=['console','swanlab'] \
    trainer.project_name="${project_name}" \
    trainer.experiment_name="${exp_name}" \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=2 \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=5 \
    trainer.total_epochs=1 \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.resume_mode=auto

Togetabetterplace avatar May 06 '25 04:05 Togetabetterplace

Same problem. Any progress?

Necolizer avatar May 29 '25 14:05 Necolizer

same problem

141forever avatar Jun 18 '25 12:06 141forever

I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok.

---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)

141forever left a comment (volcengine/verl#383)

same problem

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Togetabetterplace avatar Jun 18 '25 13:06 Togetabetterplace

I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok.

---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)

141forever left a comment (volcengine/verl#383)

same problem

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thank you, I am now using sglang-multiturn, it must be sglang(no!!!).

141forever avatar Jun 18 '25 15:06 141forever

Encountered a similar issue in single-node-multi-gpu server, which appears to be intermittent and may occur on different models. It arises after the model has been training for an unpredictable amount of time (ranging from a few hours to several dozen hours), making it difficult to diagnose. I checked the monitoring logs and did not observe any signs of GPU memory exhaustion.

The error messages are as follows:

ray.exceptions.ActorUnavailableError: The actor d47cd6db4476d07641738b0901000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: keepalive watchdog timeout; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor.

Or:

The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner IP address: dlc4qt154rmg0wx0-master-0 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.

xiaobanni avatar Jul 01 '25 03:07 xiaobanni

I met same error. The root cause of mine is that the cpu memory is full because of the reward calculation of program execution in the sandbox.

rong-hash avatar Jul 17 '25 02:07 rong-hash

I also encountered the same problem. May I ask how you solved it?

yuqikong avatar Jul 24 '25 11:07 yuqikong

Same error

pepsi2222 avatar Aug 04 '25 02:08 pepsi2222

any progess?

thu-unicorn avatar Aug 29 '25 07:08 thu-unicorn

https://www.aidoczh.com/ray/ray-core/patterns/ray-get-too-many-objects.html. 这个不知道是不是问题原因

xuelei123-xgb avatar Oct 24 '25 11:10 xuelei123-xgb

same error

StoKou avatar Nov 03 '25 02:11 StoKou

也是在执行sglang multiturn出现的,请问有解决方案了吗

StoKou avatar Nov 03 '25 02:11 StoKou

same issue

Fu-Fu-Fu-Fu avatar Nov 03 '25 10:11 Fu-Fu-Fu-Fu

same issue

AdAstraAbyssoque avatar Nov 23 '25 02:11 AdAstraAbyssoque