Potential modification of flat_rewards after batch-creation
Experiments using a ReplayBuffer have lead to surprising results where the model would be incapable to learn if our list of flat_rewards (tensors of shape (2,) ) would be stacked into a tensor of shape (batch, 2) before being pushed in the buffer. We were able to fix this issue by creating a copy of each item pushed into the replay buffer (note: only copying flat_rewards may have been required).
- Code state with the bug (commit 6a1b7b0)
- Fix (commit 8e1163a)
Our hypothesis is that flat_rewards (and potentially other tensors) are modified after the batch is created, which was not harmful when discarding the batch after the parameter update (on-policy) but became harmful when this modification caused the flat_rewards in the replay buffer to also be modified which would cause the model to train on wrong trajectory-reward pairs when this trajectory would be re-sampled from the buffer later on. It could be worth identifying what operation caused the buffer data to change to validate that this operation is intentional and should indeed occur.