⚙️ Fixed formatting and psuedocode mistake in Chapter12/3.mdx

Open hesamsheikh opened this issue 11 months ago • 0 comments

Includes fixes to:

Hyperlink in <Tip> doesn't work, so the URL is removed.
The pseudocode had a critical issue: in step b.i. "Generate group_size different outputs using initial_policy" has a mistake. According to 2.2.1 of the R1 paper, "GRPO samples a group of outputs {𝑜1, 𝑜2, · · · , 𝑜𝐺} from the old policy 𝜋𝜃𝑜𝑙𝑑" Using initial_policy here would leak policy updates into the same iteration, and must be replaced by the initial_policy The variable namings are also somewhat difficult (intial_policy gets updated which is confusing), so initial_policy is changed into current_policy to avoid confusion.

Note: The pseudocode only includes the 𝜋𝜃𝑜𝑙𝑑 and 𝜋𝜃 from the paper, but the KL divergance uses a 𝜋reference as well which despite the naming in pseudocode, is not included. My assumption is that 𝜋𝜃𝑜𝑙𝑑 (reference_policy) is used in its place for simplicity, but it's better to acknowledge this in the course.

The correct answers from the quiz were all the 1st choice, which is now jumbled.

Mar 04 '25 00:03 hesamsheikh