RLHF
RLHF copied to clipboard
GPTRewardModel class
Why are the rewards truncated in the "GPTRewardModel" class? What is the reason and where can I find more information about it?
# Retrieve first index where trajectories diverge
divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]
assert divergence_ind > 0
# Index into the correct rewards
c_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]
r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]
Thanks in advance