ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
Traceback (most recent call last):
File "
same here
I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2859585 C ...Dict.actor_rollout_compute_log_prob 20304MiB |
| 1 N/A N/A 2861969 C ...Dict.actor_rollout_compute_log_prob 20402MiB |
| 2 N/A N/A 2861970 C ...Dict.actor_rollout_compute_log_prob 20402MiB |
| 3 N/A N/A 2861971 C ...Dict.actor_rollout_compute_log_prob 20402MiB |
| 4 N/A N/A 2861972 C ...Dict.actor_rollout_compute_log_prob 20402MiB |
| 5 N/A N/A 2861973 C ...Dict.actor_rollout_compute_log_prob 20402MiB |
| 6 N/A N/A 2861974 C ...Dict.actor_rollout_compute_log_prob 20398MiB |
| 7 N/A N/A 2861975 C ...Dict.actor_rollout_compute_log_prob 19918MiB |
I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2859585 C ...Dict.actor_rollout_compute_log_prob 20304MiB | | 1 N/A N/A 2861969 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 2 N/A N/A 2861970 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 3 N/A N/A 2861971 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 4 N/A N/A 2861972 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 5 N/A N/A 2861973 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 6 N/A N/A 2861974 C ...Dict.actor_rollout_compute_log_prob 20398MiB | | 7 N/A N/A 2861975 C ...Dict.actor_rollout_compute_log_prob 19918MiB |
same problem.
I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2859585 C ...Dict.actor_rollout_compute_log_prob 20304MiB | | 1 N/A N/A 2861969 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 2 N/A N/A 2861970 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 3 N/A N/A 2861971 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 4 N/A N/A 2861972 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 5 N/A N/A 2861973 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 6 N/A N/A 2861974 C ...Dict.actor_rollout_compute_log_prob 20398MiB | | 7 N/A N/A 2861975 C ...Dict.actor_rollout_compute_log_prob 19918MiB |
I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in https://github.com/volcengine/verl/issues/12#issuecomment-2473822765
I also have the same error, where I find the training stucked in actor_rollout_compute_log_prob before the actor get dead.
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2859585 C ...Dict.actor_rollout_compute_log_prob 20304MiB | | 1 N/A N/A 2861969 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 2 N/A N/A 2861970 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 3 N/A N/A 2861971 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 4 N/A N/A 2861972 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 5 N/A N/A 2861973 C ...Dict.actor_rollout_compute_log_prob 20402MiB | | 6 N/A N/A 2861974 C ...Dict.actor_rollout_compute_log_prob 20398MiB | | 7 N/A N/A 2861975 C ...Dict.actor_rollout_compute_log_prob 19918MiB |I fixed it by setting actor_rollout_ref.rollout.log_prob_micro_batch_size // world_size !=0,as mentioned in #12 (comment)
I attempted to modify log_prob_micro_batch_size, but the training still gets stuck at actor_rollout_compute_log_prob.
I suspect the issue might be related to ulysses_sequence_parallel_size. When I set ulysses_sequence_parallel_size=8, the training gets stuck at actor_rollout_compute_log_prob. However, when I set ulysses_sequence_parallel_size=4, the training no longer gets stuck at actor_rollout_compute_log_prob, but it will sometimes result in an out-of-memory (OOM) error during the actor update.
same error
same
same here, waiting for an official solution
same error
same
The error message is a general one from ray. Any failed ray job may start with ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. It is important to look at the trace related to verl and where it failed.
Setting log_prob_micro_batch_size_per_gpu is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issue
The error message is a general one from ray. Any failed ray job may start with . It is important to look at the trace related to verl and where it failed.
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this taskSetting is recommended so that there should be at least one sample per GPU. If that value is set and you still observe training hang or other errors, please provide a script to reproduce the issue
log_prob_micro_batch_size_per_gpu
We are currently testing multi-machine, multi-GPU training with two machines, each equipped with a single 40G A100 GPU. We plan to train and test the data based on qwen-2.5-0.5B. The same issue has occurred:
we modify the script and an error is displayed after configuring the micro_batch_size_per_gpu:
Could not override 'actor_rollout_ref.actor.micro_batch_size_per_gpu'. To append to your config use +actor_rollout_ref.actor.micro_batch_size_per_gpu=1 Key 'micro_batch_size_per_gpu' is not in struct full_key: actor_rollout_ref.actor.micro_batch_size_per_gpu object_type=dict
script:
#!/usr/bin/env bash
set -xeuo pipefail
project_name='d1'
exp_name='DAPO-Qwen2.5-0.5B'
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0
clip_ratio_low=0.2
clip_ratio_high=0.28
max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 20))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0
loss_agg_mode="token-mean"
enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=0
train_prompt_bsz=2
gen_prompt_bsz=$((train_prompt_bsz * 2))
n_resp_per_prompt=4
train_prompt_mini_bsz=1
# Ray
RAY_ADDRESS=${RAY_ADDRESS:-"http://11.220.*1.**6:8265"}
WORKING_DIR=${WORKING_DIR:-"${PWD}"}
RUNTIME_ENV=${RUNTIME_ENV:-"${WORKING_DIR}/verl/trainer/runtime_env.yaml"}
NNODES=${NNODES:-2}
# Paths
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/code/verl"}
MODEL_PATH=${MODEL_PATH:-"${RAY_DATA_HOME}/models/Qwen2.5-0.5B"}
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/data/dapo-math-17k.parquet"}
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/data/aime-2024.parquet"}
# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
sp_size=2
use_dynamic_bsz=False # change
actor_ppo_max_token_len=$((max_prompt_length + max_response_length))
infer_ppo_max_token_len=$((max_prompt_length + max_response_length))
offload=True
gen_tp=1
ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
--working-dir "${WORKING_DIR}" \
-- python3 -m recipe.dapo.src.main_dapo \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.prompt_key=prompt \
data.truncation='left' \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
data.gen_batch_size=${gen_prompt_bsz} \
data.train_batch_size=${train_prompt_bsz} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
algorithm.adv_estimator=${adv_estimator} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
algorithm.filter_groups.enable=${enable_filter_groups} \
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
algorithm.filter_groups.metric=${filter_groups_metric} \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ # change
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
+actor_rollout_ref.model.override_config.attention_dropout=0. \
+actor_rollout_ref.model.override_config.embd_pdrop=0. \
+actor_rollout_ref.model.override_config.resid_pdrop=0. \
+actor_rollout_ref.model.override_config.torch_dtype="float16" \
actor_rollout_ref.model.enable_gradient_checkpointing=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.80 \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k="${top_k}" \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
reward_model.reward_manager=dapo \
reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
reward_model.overlong_buffer.len=${overlong_buffer_len} \
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
trainer.logger=['console','swanlab'] \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=1 \
trainer.nnodes=2 \
trainer.val_before_train=True \
trainer.test_freq=5 \
trainer.save_freq=5 \
trainer.total_epochs=1 \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto
Same problem. Any progress?
same problem
I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok.
---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)
141forever left a comment (volcengine/verl#383)
same problem
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
I finally found that it is a cluster communication bug, you can test it with vllm for inference on multimachine. If it works, then the cluster communication is ok. …
---Original--- From: "Cola @.> Date: Wed, Jun 18, 2025 20:15 PM To: @.>; Cc: @.@.>; Subject: Re: [volcengine/verl] ray.exceptions.ActorDiedError: The actor diedunexpectedly before finishing this task. (Issue #383)
141forever left a comment (volcengine/verl#383)
same problem
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thank you, I am now using sglang-multiturn, it must be sglang(no!!!).
Encountered a similar issue in single-node-multi-gpu server, which appears to be intermittent and may occur on different models. It arises after the model has been training for an unpredictable amount of time (ranging from a few hours to several dozen hours), making it difficult to diagnose. I checked the monitoring logs and did not observe any signs of GPU memory exhaustion.
The error messages are as follows:
ray.exceptions.ActorUnavailableError: The actor d47cd6db4476d07641738b0901000000 is unavailable: The actor is temporarily unavailable: RpcError: RPC Error message: keepalive watchdog timeout; RPC Error details: rpc_code: 14. The task may or may not have been executed on the actor.
Or:
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner IP address: dlc4qt154rmg0wx0-master-0 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
I met same error. The root cause of mine is that the cpu memory is full because of the reward calculation of program execution in the sandbox.
I also encountered the same problem. May I ask how you solved it?
Same error
any progess?
https://www.aidoczh.com/ray/ray-core/patterns/ray-get-too-many-objects.html. 这个不知道是不是问题原因
same error
也是在执行sglang multiturn出现的,请问有解决方案了吗
same issue
same issue