EfficientZero
EfficientZero copied to clipboard
Question about the effect of discount factor and done mask when calculating the target value?
Thanks for your open-sourced code very much.
This is a common definition of an target value in classical RL:

I'm a little confused about the way of calculating target value here in reanalyze_worker.py:
Why we do not multiply the bootstrap value (here is value_lst) by the discount_factor^td_steps, and why we do not mask the bootsrap value when the target obs is a done state.
Looking forward to your reply!