xzhang
xzhang
hi, I notice that in your code, mean_kl always=0 constraint_grad = flat_grad(constraint_loss, self.policy.parameters(), retain_graph=True) # (b) mean_kl = mean_kl_first_fixed(actions_dists, actions_dists) Fvp_fun = get_Hvp_fun(mean_kl, self.policy.parameters()) what is the meaning of a...
there is no "ant_gather" in the envs folder.
reward_advs -= reward_advs.mean() reward_advs /= reward_advs.std() cost_advs -= \textbf{reward_advs}.mean() cost_advs /= cost_advs.std() I guess on line 3, it should be mean of the cost?
log_action_probs = action_dists.log_prob(actions) imp_sampling = torch.exp(log_action_probs - log_action_probs.detach()) # Change to torch.matmul reward_loss = -torch.mean(imp_sampling * reward_advs) Since, log_action_probs - log_action_probs.detach()=0, imp_sampling is a all one vector