agents action output and policy_step_spec structures do not match:

If I use e.g. the Reinforce algorithm the call agent.collect_policy.action(time_step, policy_state) works as for all other algorithms except for PPO. Here the PPO Policy (which inherits from ActorPolicy) outpost the error message "action output and policy_step_spec structures do not match:" More details about the specs:

PolicyStep(action=., state={'actor_network_state': [., .]}, info={'dist_params': {'loc': .}}) vs. PolicyStep(action=., state=DictWrapper({'actor_network_state': ListWrapper([., .])}), info=DictWrapper({'dist_params': DictWrapper({'loc': ., 'scale_diag': .})}))

This seems to be a problem of the PPO Policy which is perhaps unable to unwrap the DictWrapper or ListWrapper. Moreover, there is a problem in the dimensions which only occurs with PPO:

PolicyStep(action=<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.9669268]], dtype=float32)>, state={'actor_network_state': [<tf.Tensor: shape=(1, 64), dtype=float32, numpy= array([[...]], dtype=float32)>, <tf.Tensor: shape=(1, 64), dtype=float32, numpy= array([[...]], dtype=float32)>]}, info={'dist_params': {'loc': <tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[...]], dtype=float32)>}}) vs. PolicyStep(action=BoundedTensorSpec(shape=(1,), dtype=tf.float32, name='action', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)), state=DictWrapper({'actor_network_state': ListWrapper([TensorSpec(shape=(64,), dtype=tf.float32, name='network_state_0'), TensorSpec(shape=(64,), dtype=tf.float32, name='network_state_1')])}), info=DictWrapper({'dist_params': DictWrapper({'loc': TensorSpec(shape=(1,), dtype=tf.float32, name='TanhNormalProjectionNetwork_loc'), 'scale_diag': TensorSpec(shape=(1,), dtype=tf.float32, name='TanhNormalProjectionNetwork_scale_diag')})}))

Note: I removed the values inside the numpy arrry here to make it easier to read.

This issue is only for the collect policy. The Policy of PPO for agent.policy (Greedy Policy) works fine with the above specs/inputs. Any suggestions why this happens only with the PPO agent/policy? Could somebody provide a workaround or fix the issue?

Apr 05 '22 15:04 PeterDomanski

It seems that the collecting policy doesn't have scale_diag as part of the dist_params so I suppose there is a mismatch in the policy used to collect the data the policy used for training. One uses a fixed std and the other uses scale_diag.

Apr 05 '22 19:04 sguada

Should that be changed for the PPO Policy as all other GreedyPolicies/ Actor Policies have a different structure including the scale_diag argument (as part of dist_params)

Apr 05 '22 21:04 PeterDomanski

What I meant is that one should use the same policy to collect the data that it uses for training. So mixing policies between algorithms it's not warranted to work.

Apr 06 '22 10:04 sguada

It is the same PPO policy but for collection, the examples here show to use the collect policy (ppo_agent.collect_policy) of PPO and for training one uses the ppo_agent.policy (greedy policy). This works for all RL algorithm in TF Agents except for PPO

Apr 06 '22 11:04 PeterDomanski

I'm not sure what code are you running, and what do you mean by (using greedy policy in training, since that should only be used for eval). Are you using the PPO example or are you using a different code base?

Apr 06 '22 11:04 sguada

My code rather looks like this example (https://www.tensorflow.org/agents/tutorials/6_reinforce_tutorial) but instead of the reinforce agent I use the PPO Agent (https://github.com/tensorflow/agents/blob/master/tf_agents/agents/ppo/ppo_agent.py) In addition, I use RNN actor and value networks and thus call policy.action(time_step, policy_state). In that statement, policy_state is causing the error.

Apr 06 '22 11:04 PeterDomanski

The issue is with how ppo_utils.get_distribution_params() is used inside the PPOPolicy class. When the PPOClipAgent has an ActorDistributionNetwork, the scale parameter of the output distribution is (for some reason) not a tensor, and it is filtered out by ppo_utils.get_distribution_params().

Aug 26 '22 21:08 samarth-robo