Mava [BUG] Potential Recurrent IPPO Bug

Describe the bug

Its possible there is a slight error in the resetting of hidden states for IPPO. Instead of using the done flag to reset, it would be simpler and ensure correctness to use the timestep.first() flag to reset the hidden state. I've linked below a picture of a rware tiny-2ag run using the fix and using the current develop branch. Although both are increasing there is roughly a 100% or 2x performance at each interval point.

Below is the Actor Mean Return: Screenshot 2024-02-07 at 11 24 49

Below is the Evaluator Mean Return: Screenshot 2024-02-07 at 11 26 06

Although this could be variance, nothing was changed between the experiments besides the fix. I've pushed the branch that ran the orange line experiment and it is called fix/recurrent-ppo. This would be a worthwhile investigation.

Feb 07 '24 09:02 EdanToledo

A link to the branch with this fix.

Feb 09 '24 14:02 RuanJohn

An update on this, the suggested fix uses timestep.first() to reset the hidden states, but since no timestep will ever be a timestep.first() when using Jumanji's auto reset wrapper the hidden states were just never getting reset. Not ever resetting the hidden states does lead to far better performance for the recurrent systems though so this is a whole other question to consider in itself.

Feb 13 '24 11:02 RuanJohn

Bug was fixed in #1076

Jun 14 '24 12:06 sash-a