verl ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.

Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 130, in main() File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() ^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( ^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value ^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 25, in main run_ppo(config) File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 33, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 2772, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=635457, ip=127.0.0.1) File "/opt/tiger/verl/verl/trainer/main_ppo.py", line 126, in main_task trainer.fit() File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 862, in fit val_metrics = self._validate() ^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/trainer/ppo/ray_trainer.py", line 631, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/tiger/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: create_colocated_worker_cls..WorkerDict actor_id: 51333b5b40d3feca28206af601000000 pid: 645217 name: 0o0QHzWorkerDict_0:6 namespace: 9b824d07-ee25-46ef-bdd4-4c993aab9272 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Feb 25 '25 13:02 fengyang95

same here

Feb 27 '25 11:02 asirgogogo

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

Feb 28 '25 02:02 ChaosCodes

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

same problem.

Feb 28 '25 06:02 yenanjing

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0，as mentioned in https://github.com/volcengine/verl/issues/12#issuecomment-2473822765

Feb 28 '25 08:02 yenanjing

I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2859585      C   ...Dict.actor_rollout_compute_log_prob      20304MiB |
|    1   N/A  N/A   2861969      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    2   N/A  N/A   2861970      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    3   N/A  N/A   2861971      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    4   N/A  N/A   2861972      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    5   N/A  N/A   2861973      C   ...Dict.actor_rollout_compute_log_prob      20402MiB |
|    6   N/A  N/A   2861974      C   ...Dict.actor_rollout_compute_log_prob      20398MiB |
|    7   N/A  N/A   2861975      C   ...Dict.actor_rollout_compute_log_prob      19918MiB |

I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0，as mentioned in #12 (comment)

I attempted to modify log_prob_micro_batch_size, but the training still gets stuck at actor_rollout_compute_log_prob.

I suspect the issue might be related to ulysses_sequence_parallel_size. When I set ulysses_sequence_parallel_size=8, the training gets stuck at actor_rollout_compute_log_prob. However, when I set ulysses_sequence_parallel_size=4, the training no longer gets stuck at actor_rollout_compute_log_prob, but it will sometimes result in an out-of-memory (OOM) error during the actor update.

Feb 28 '25 18:02 ChaosCodes

same error

Mar 01 '25 14:03 muxixi727

same

Mar 06 '25 09:03 Gxy-2001

same here, waiting for an official solution

Mar 07 '25 06:03 Yukang-Lin

same error

Mar 21 '25 04:03 yushuiwx

same

Mar 28 '25 06:03 BUGLI27

The error message is a general one from ray. Any failed ray job may start with ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. It is important to look at the trace related to verl and where it failed.

Setting log_prob_micro_batch_size_per_gpu is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issue

Apr 04 '25 20:04 eric-haibin-lin

The error message is a general one from ray. Any failed ray job may start with . It is important to look at the trace related to verl and where it failed.ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task

Setting is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issuelog_prob_micro_batch_size_per_gpu

We are currently testing multi-machine, multi-GPU training with two machines, each equipped with a single 40G A100 GPU. We plan to train and test the data based on qwen-2.5-0.5B. The same issue has occurred:

we modify the script and an error is displayed after configuring the micro_batch_size_per_gpu:

Could not override 'actor_rollout_ref.actor.micro_batch_size_per_gpu'. To append to your config use +actor_rollout_ref.actor.micro_batch_size_per_gpu=1 Key 'micro_batch_size_per_gpu' is not in struct full_key: actor_rollout_ref.actor.micro_batch_size_per_gpu object_type=dict

script：

#!/usr/bin/env bash
set -xeuo pipefail

project_name='d1'
exp_name='DAPO-Qwen2.5-0.5B'

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 20))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=0
train_prompt_bsz=2
gen_prompt_bsz=$((train_prompt_bsz * 2))
n_resp_per_prompt=4
train_prompt_mini_bsz=1

# Ray
RAY_ADDRESS=${RAY_ADDRESS:-"http://11.220.*1.**6:8265"}
WORKING_DIR=${WORKING_DIR:-"${PWD}"}
RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
NNODES=${NNODES:-2}
# Paths
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/code/verl"}
MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-0.5B"}
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7

# Performance Related Parameter
sp_size=2
use_dynamic_bsz=False # change
actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
offload=True
gen_tp=1

ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
    --working-dir "${WORKING_DIR}" \
    -- python3 -m recipe.dapo.src.main_dapo \
    data.train_files="${TRAIN_FILE}" \
    data.val_files="${TEST_FILE}" \
    data.prompt_key=prompt \
    data.truncation='left' \
    data.max_prompt_length=${max_prompt_length} \
    data.max_response_length=${max_response_length} \
    data.gen_batch_size=${gen_prompt_bsz} \
    data.train_batch_size=${train_prompt_bsz} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    algorithm.adv_estimator=${adv_estimator} \
    algorithm.use_kl_in_reward=${use_kl_in_reward} \
    algorithm.kl_ctrl.kl_coef=${kl_coef} \
    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
    actor_rollout_ref.actor.clip_ratio_c=10.0 \
    algorithm.filter_groups.enable=${enable_filter_groups} \
    algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
    algorithm.filter_groups.metric=${filter_groups_metric} \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \     # change
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
    +actor_rollout_ref.model.override_config.attention_dropout=0. \
    +actor_rollout_ref.model.override_config.embd_pdrop=0. \
    +actor_rollout_ref.model.override_config.resid_pdrop=0. \
    +actor_rollout_ref.model.override_config.torch_dtype="float16" \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
    actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.grad_clip=1.0 \
    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
    actor_rollout_ref.rollout.enable_chunked_prefill=True \
    actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
    actor_rollout_ref.rollout.temperature=${temperature} \
    actor_rollout_ref.rollout.top_p=${top_p} \
    actor_rollout_ref.rollout.top_k="${top_k}" \
    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
    reward_model.reward_manager=dapo \
    reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
    reward_model.overlong_buffer.len=${overlong_buffer_len} \
    reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
    trainer.logger=['console','swanlab'] \
    trainer.project_name="${project_name}" \
    trainer.experiment_name="${exp_name}" \
    trainer.n_gpus_per_node=1 \
    trainer.nnodes=2 \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=5 \
    trainer.total_epochs=1 \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.resume_mode=auto

May 06 '25 04:05 Togetabetterplace

Same problem. Any progress?

May 29 '25 14:05 Necolizer

same problem

Jun 18 '25 12:06 141forever

I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok.

---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)

141forever left a comment (volcengine/verl#383)

same problem

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jun 18 '25 13:06 Togetabetterplace

I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok. …

---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)

141forever left a comment (volcengine/verl#383)

same problem

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Thank you, I am now using sglang-multiturn, it must be sglang(no!!!).

Jun 18 '25 15:06 141forever

Encountered a similar issue in single-node-multi-gpu server, which appears to be intermittent and may occur on different models. It arises after the model has been training for an unpredictable amount of time (ranging from a few hours to several dozen hours), making it difficult to diagnose. I checked the monitoring logs and did not observe any signs of GPU memory exhaustion.

The error messages are as follows:

ray.exceptions.ActorUnavailableError: The actor d47cd6db4476d07641738b0901000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: keepalive watchdog timeout; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor.

Or:

The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner IP address: dlc4qt154rmg0wx0-master-0 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.