verl icon indicating copy to clipboard operation
verl copied to clipboard

Mismatch between trajectory and reward_extra_infos_dict when using _log_rollout_data

Open mirrorboat opened this issue 2 months ago • 1 comments

System Info

在RayPPOTrainer中,_balance_batch会打乱batch的顺序,似乎会导致执行_log_rollout_data 时reward_extra_infos_dict顺序和batch不一致。 注:读代码时发现的潜在问题,尚未尝试复现

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

同时开启balance_batch和rollout_data_dir

Expected behavior

reward_extra_infos_dict中的信息的顺序应该和batch中的轨迹一致

mirrorboat avatar Nov 21 '25 12:11 mirrorboat

为啥不一致呢?后面计算 reward_extra_infos_dict 不是按照balance过的batch去计算的吗?

# === 配置 ===
  config = {
      "trainer": {
          "balance_batch": True,           # ✅ 开启
          "rollout_data_dir": "./rollout"  # ✅ 开启
      }
  }

  # === Step 0: 原始 batch ===
  原始顺序: [Sample_A, Sample_B, Sample_C, Sample_D]
  batch.batch["prompts"] = [prompt_A, prompt_B, prompt_C, prompt_D]

  # === Step 1: Line 1088 - _balance_batch() ===
  self._balance_batch(batch, metrics)
  # batch 被原地重排为: [Sample_C, Sample_A, Sample_D, Sample_B]
  batch.batch["prompts"] = [prompt_C, prompt_A, prompt_D, prompt_B]

  # === Step 2: Line 1104 - compute_reward() ===
  # 🔑 关键:这里的 batch 已经是重排后的了!
  reward_tensor, reward_extra_infos_dict = compute_reward(batch, self.reward_fn)

  # compute_reward 内部处理的是重排后的 batch:
  for i in range(len(batch)):  # 遍历的是 [C, A, D, B]
      prompt_str = decode(batch[i].batch["prompts"])  # prompt_C, prompt_A, ...
      response_str = decode(batch[i].batch["responses"])
      score = compute_score(prompt_str, response_str, ...)
      reward_extra_info["custom_metric"].append(score)

  # 所以返回的 reward_extra_infos_dict 顺序是:
  reward_extra_infos_dict = {
      "custom_metric": [metric_C, metric_A, metric_D, metric_B]  # ✅ 对应重排后的顺序
  }

  # === Step 3: Line 1164 - update ===
  batch.non_tensor_batch.update({k: np.array(v) for k, v in reward_extra_infos_dict.items()})
  # batch.non_tensor_batch 现在包含:
  # {
  #     "custom_metric": [metric_C, metric_A, metric_D, metric_B]  # ✅ 和 reward_extra_infos_dict 一样
  # }

  # === Step 4: Line 1227 - _log_rollout_data() ===
  self._log_rollout_data(batch, reward_extra_infos_dict, timing_raw, rollout_data_dir)

  # 在 _log_rollout_data 内部:
  inputs = decode(batch.batch["prompts"])
  # = [prompt_C, prompt_A, prompt_D, prompt_B]  ✅

  outputs = decode(batch.batch["responses"])
  # = [response_C, response_A, response_D, response_B]  ✅

  scores = batch.batch["token_level_scores"].sum(-1)
  # = [score_C, score_A, score_D, score_B]  ✅

  reward_extra_infos_to_dump = reward_extra_infos_dict.copy()
  # = {"custom_metric": [metric_C, metric_A, metric_D, metric_B]}  ✅

  # === Step 5: 保存到文件 ===
  # rollout_data_dir/0.jsonl:
  # {"input": prompt_C, "output": response_C, "score": score_C, "custom_metric": metric_C}  ✅ 对齐!
  # {"input": prompt_A, "output": response_A, "score": score_A, "custom_metric": metric_A}  ✅ 对齐!
  # {"input": prompt_D, "output": response_D, "score": score_D, "custom_metric": metric_D}  ✅ 对齐!
  # {"input": prompt_B, "output": response_B, "score": score_B, "custom_metric": metric_B}  ✅ 对齐!

JobQiu avatar Nov 23 '25 10:11 JobQiu