Aaron Havens
Results
2
comments of
Aaron Havens
@natsuki14 It is the meta-learning objective of the master policy to choose the best sub-policy to reach the current goal. The correct goal remains a latent condition of the MDP....
I agree with you. We probably shouldn't be rolling in the expectation of a different policy into a sub-policy advantage. They are practically different MDP's.