RL related issues, code understanding , scores, rewards,ppo

Open smartparrot opened this issue 2 years ago • 1 comments

🚀 The feature, motivation, and pitch

https://github.com/CarperAI/trlx/blob/b91da7b03d8e9fa0c0d6dce10a8f2611aca3013f/trlx/trainer/accelerate_ppo_trainer.py

(1) In line 307, scores = torch.tensor( self.reward_fn( samples=str_samples, prompts=str_prompts, outputs=str_outputs, ), dtype=torch.float, ).to(device)

What's the meaning of scores and how it is calculated, what's self.reward_fn? What role does scores play in the ppo RL framework?

(2) In line 440, why let rewards[-1] adds scores[sample_idx]? rewards = sample_kl_divergence_estimate rewards[-1] += scores[sample_idx].cpu()

Thanks

Alternatives

No response

Additional context

No response

Feb 22 '23 08:02 smartparrot

I found self.reward_fn is: def reward_fn(samples: List[str], **kwargs): original_samples = [text.split("TL;DR:")[0] + "TL;DR: " for text in samples] original_samples = [text + post_summary_dict[text.strip()] for text in original_samples] original_scores = get_scores(original_samples) scores = get_scores(samples) norms_scores = scores - original_scores return norms_scores

Feb 24 '23 11:02 smartparrot