RL related issues, code understanding , scores, rewards,ppo
🚀 The feature, motivation, and pitch
https://github.com/CarperAI/trlx/blob/b91da7b03d8e9fa0c0d6dce10a8f2611aca3013f/trlx/trainer/accelerate_ppo_trainer.py
(1) In line 307, scores = torch.tensor( self.reward_fn( samples=str_samples, prompts=str_prompts, outputs=str_outputs, ), dtype=torch.float, ).to(device)
What's the meaning of scores and how it is calculated, what's self.reward_fn? What role does scores play in the ppo RL framework?
(2) In line 440, why let rewards[-1] adds scores[sample_idx]? rewards = sample_kl_divergence_estimate rewards[-1] += scores[sample_idx].cpu()
Thanks
Alternatives
No response
Additional context
No response
I found self.reward_fn is: def reward_fn(samples: List[str], **kwargs): original_samples = [text.split("TL;DR:")[0] + "TL;DR: " for text in samples] original_samples = [text + post_summary_dict[text.strip()] for text in original_samples] original_scores = get_scores(original_samples) scores = get_scores(samples) norms_scores = scores - original_scores return norms_scores