[BUG] Potential Recurrent IPPO Bug
Describe the bug
Its possible there is a slight error in the resetting of hidden states for IPPO. Instead of using the done flag to reset, it would be simpler and ensure correctness to use the timestep.first() flag to reset the hidden state. I've linked below a picture of a rware tiny-2ag run using the fix and using the current develop branch. Although both are increasing there is roughly a 100% or 2x performance at each interval point.
Below is the Actor Mean Return:
Below is the Evaluator Mean Return:
Although this could be variance, nothing was changed between the experiments besides the fix. I've pushed the branch that ran the orange line experiment and it is called fix/recurrent-ppo. This would be a worthwhile investigation.
A link to the branch with this fix.
An update on this, the suggested fix uses timestep.first() to reset the hidden states, but since no timestep will ever be a timestep.first() when using Jumanji's auto reset wrapper the hidden states were just never getting reset.
Not ever resetting the hidden states does lead to far better performance for the recurrent systems though so this is a whole other question to consider in itself.
Bug was fixed in #1076