GPTRewardModel class

Open israelpf opened this issue 1 year ago • 0 comments

Why are the rewards truncated in the "GPTRewardModel" class? What is the reason and where can I find more information about it?

        # Retrieve first index where trajectories diverge
        divergence_ind = (chosen[i] != rejected[i]).nonzero()[0]
        assert divergence_ind > 0

        # Index into the correct rewards
        c_truncated_reward = chosen_rewards[i][divergence_ind:end_ind]
        r_truncated_reward = rejected_rewards[i][divergence_ind:end_ind]

Thanks in advance

Sep 11 '24 07:09 israelpf