verl icon indicating copy to clipboard operation
verl copied to clipboard

logits scaling affect rollout_actor_probs_pearson_corr metrics

Open Lokiscripter opened this issue 3 months ago • 0 comments

System Info

verl: latest

python: 3.10.6

Information

  • [x] The official example scripts
  • [x] My own modified scripts

Tasks

  • [x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

When using VLLM for rollout, the returned rollout_log_probs are computed logits that are not scaled by temperature. If scaling is applied on the training side, it will lead to an incorrect value for the metric "rollout_actor_probs_pearson_corr." Generally, a correlation coefficient closer to 1 is considered to indicate better alignment between training and inference.

Using FSDP also has similar issues.

# verl/workers/actor/megatron_actor.py
                def logits_processor(logits, label, label_mask):
                    assert logits.shape[:2] == label.shape[:2]
                    assert label.shape == label_mask.shape
                    logits.div_(temperature) #### this affect rollout_actor_probs_pearson_corr
                    ret = {}
                    if calculate_entropy:
                        logits_bak = logits.clone()
                        logger.warning_once(
                            "For memory-efficient computation, enable fused kernels via "
                            "`actor_rollout_ref.model.use_fused_kernels=True`. "
                            "The current `clone()` operation ensures correctness but increases memory usage."
                        )
                        entropy = vocab_parallel_entropy(logits)
                        ret["entropy"] = entropy
                    else:
                        logits_bak = logits
                    log_probs = vocab_parallel_log_probs_from_logits(logits_bak, label)
                    log_probs = log_probs.masked_fill(~label_mask, 0.0)
                    ret["log_probs"] = log_probs
                    return ret

Expected behavior

Training does not perform temperature scaling on logits, which keeps it aligned with inference, making the calculated rollout_actor_probs_pearson_corr meaningful.

Lokiscripter avatar Nov 17 '25 09:11 Lokiscripter