31CFDC30 issues

Results 1 issues of

31CFDC30

我记得在之前的版本中advantages = td_target - state_values，td_target使用reward计算，而state_values使用迭代后的policy进行估计。