Potential modification of flat_rewards after batch-creation

Open julienroyd opened this issue 2 years ago • 0 comments

Experiments using a ReplayBuffer have lead to surprising results where the model would be incapable to learn if our list of flat_rewards (tensors of shape (2,) ) would be stacked into a tensor of shape (batch, 2) before being pushed in the buffer. We were able to fix this issue by creating a copy of each item pushed into the replay buffer (note: only copying flat_rewards may have been required).

Code state with the bug (commit 6a1b7b0)
Fix (commit 8e1163a)

Our hypothesis is that flat_rewards (and potentially other tensors) are modified after the batch is created, which was not harmful when discarding the batch after the parameter update (on-policy) but became harmful when this modification caused the flat_rewards in the replay buffer to also be modified which would cause the model to train on wrong trajectory-reward pairs when this trajectory would be re-sampled from the buffer later on. It could be worth identifying what operation caused the buffer data to change to validate that this operation is intentional and should indeed occur.

Apr 06 '23 19:04 julienroyd