course
course copied to clipboard
⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx
Includes fixes to:
- Hyperlink in <Tip> doesn't work, so the URL is removed.
- The pseudocode had a critical issue: in step b.i. "Generate group_size different outputs using initial_policy" has a mistake. According to 2.2.1 of the R1 paper, "GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑" Using initial_policy here would leak policy updates into the same iteration, and must be replaced by the initial_policy The variable namings are also somewhat difficult (intial_policy gets updated which is confusing), so initial_policy is changed into current_policy to avoid confusion.
Note: The pseudocode only includes the 𝜋𝜃𝑜𝑙𝑑 and 𝜋𝜃 from the paper, but the KL divergance uses a 𝜋reference as well which despite the naming in pseudocode, is not included. My assumption is that 𝜋𝜃𝑜𝑙𝑑 (reference_policy) is used in its place for simplicity, but it's better to acknowledge this in the course.
- The correct answers from the quiz were all the 1st choice, which is now jumbled.