How "trajectory divergence term" is calculated in compute_cost function in algorithm_traj_opt.py
Could you help to give any reference to this part of code? Thank you!
def compute_costs(self, m, eta, augment=True):
""" Compute cost estimates used in the LQR backward pass. """
traj_info, traj_distr = self.cur[m].traj_info, self.cur[m].traj_distr
if not augment: # Whether to augment cost with term to penalize KL
return traj_info.Cm, traj_info.cv
multiplier = self._hyperparams['max_ent_traj']
fCm, fcv = traj_info.Cm / (eta + multiplier), traj_info.cv / (eta + multiplier)
K, ipc, k = traj_distr.K, traj_distr.inv_pol_covar, traj_distr.k
# Add in the trajectory divergence term.
for t in range(self.T - 1, -1, -1):
fCm[t, :, :] += eta / (eta + multiplier) * np.vstack([
np.hstack([
K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),
-K[t, :, :].T.dot(ipc[t, :, :])
]),
np.hstack([
-ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]
])
])
fcv[t, :] += eta / (eta + multiplier) * np.hstack([
K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),
-ipc[t, :, :].dot(k[t, :])
])
return fCm, fcv
This part is to add divergence of predicted trajectory and sampled trajectory as additional cost. i.e. (Kx + k - u).T * inverse_policy_variance_matrix * (Kx+k -u) u is sampled action from data. Kx + k is predicted action from global policy network.
@wangsd01 Hi, I am also looking at these lines. Have you solved the problem?
I am not 100% sure what's happening, but one thing that looks especially suspicious to me is that the derivative to u is Cov^{-1}.dot(k_old).
In the code repo, by looking at the forward pass, it uses u = Kx + k, instead of u = K(x-x_old) + k + u_old. And therefore, I kinda feel that if we actually take the derivative of the KL penalty wrt u, we will have something like Cov^{-1}.dot(u_new - u_old) = Cov^{-1}.dot(K_new.dot(x) - K_old.dot(x) + k_new - k_old) != Cov^{-1}.dot(k_old).
Not sure if I missed anything. Be great if you could help :( @cbfinn