verl
verl copied to clipboard
Mismatch between trajectory and reward_extra_infos_dict when using _log_rollout_data
System Info
在RayPPOTrainer中,_balance_batch会打乱batch的顺序,似乎会导致执行_log_rollout_data 时reward_extra_infos_dict顺序和batch不一致。 注:读代码时发现的潜在问题,尚未尝试复现
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
同时开启balance_batch和rollout_data_dir
Expected behavior
reward_extra_infos_dict中的信息的顺序应该和batch中的轨迹一致
为啥不一致呢?后面计算 reward_extra_infos_dict 不是按照balance过的batch去计算的吗?
# === 配置 ===
config = {
"trainer": {
"balance_batch": True, # ✅ 开启
"rollout_data_dir": "./rollout" # ✅ 开启
}
}
# === Step 0: 原始 batch ===
原始顺序: [Sample_A, Sample_B, Sample_C, Sample_D]
batch.batch["prompts"] = [prompt_A, prompt_B, prompt_C, prompt_D]
# === Step 1: Line 1088 - _balance_batch() ===
self._balance_batch(batch, metrics)
# batch 被原地重排为: [Sample_C, Sample_A, Sample_D, Sample_B]
batch.batch["prompts"] = [prompt_C, prompt_A, prompt_D, prompt_B]
# === Step 2: Line 1104 - compute_reward() ===
# 🔑 关键:这里的 batch 已经是重排后的了!
reward_tensor, reward_extra_infos_dict = compute_reward(batch, self.reward_fn)
# compute_reward 内部处理的是重排后的 batch:
for i in range(len(batch)): # 遍历的是 [C, A, D, B]
prompt_str = decode(batch[i].batch["prompts"]) # prompt_C, prompt_A, ...
response_str = decode(batch[i].batch["responses"])
score = compute_score(prompt_str, response_str, ...)
reward_extra_info["custom_metric"].append(score)
# 所以返回的 reward_extra_infos_dict 顺序是:
reward_extra_infos_dict = {
"custom_metric": [metric_C, metric_A, metric_D, metric_B] # ✅ 对应重排后的顺序
}
# === Step 3: Line 1164 - update ===
batch.non_tensor_batch.update({k: np.array(v) for k, v in reward_extra_infos_dict.items()})
# batch.non_tensor_batch 现在包含:
# {
# "custom_metric": [metric_C, metric_A, metric_D, metric_B] # ✅ 和 reward_extra_infos_dict 一样
# }
# === Step 4: Line 1227 - _log_rollout_data() ===
self._log_rollout_data(batch, reward_extra_infos_dict, timing_raw, rollout_data_dir)
# 在 _log_rollout_data 内部:
inputs = decode(batch.batch["prompts"])
# = [prompt_C, prompt_A, prompt_D, prompt_B] ✅
outputs = decode(batch.batch["responses"])
# = [response_C, response_A, response_D, response_B] ✅
scores = batch.batch["token_level_scores"].sum(-1)
# = [score_C, score_A, score_D, score_B] ✅
reward_extra_infos_to_dump = reward_extra_infos_dict.copy()
# = {"custom_metric": [metric_C, metric_A, metric_D, metric_B]} ✅
# === Step 5: 保存到文件 ===
# rollout_data_dir/0.jsonl:
# {"input": prompt_C, "output": response_C, "score": score_C, "custom_metric": metric_C} ✅ 对齐!
# {"input": prompt_A, "output": response_A, "score": score_A, "custom_metric": metric_A} ✅ 对齐!
# {"input": prompt_D, "output": response_D, "score": score_D, "custom_metric": metric_D} ✅ 对齐!
# {"input": prompt_B, "output": response_B, "score": score_B, "custom_metric": metric_B} ✅ 对齐!