ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: ChatGPT: why total reward is reward = r - kl_coef * kl, not total_reward=r + gamma * critic(next_states)?

Open liukaiyueyuo opened this issue 2 years ago • 1 comments

🐛 Describe the bug

ChatGPT: why total reward is reward = r - kl_coef * kl, not total_reward=r + gamma * critic(next_states)? image

Environment

No response

liukaiyueyuo avatar Feb 21 '23 03:02 liukaiyueyuo

Because as we think, the rl training process here is a one-step process, which means there isn't a next_state.

ht-zhou avatar Feb 21 '23 03:02 ht-zhou

I'll close this issue now, please reopen the issue if you have further questions.

ht-zhou avatar Feb 22 '23 02:02 ht-zhou