In tf_Agents, Is there a way to retrain the environment order/sequence of generated observations in the driver and replay buffer
I am using the tf_agents for contextual bandit algorithm. My data is at userlevel and hence very important to make sure that actions and rewards (and trajectories in order to train on them) are generated in sequence (that way I can tie these back to the sequence of the user ids in the dataset). Here is the sequence
-
observations (, actions, rewards) are generated in custom environment in a sequence (say a,b,c)
-
When the driver runs through and saves in replay buffer, I need the sequence to be maintained.
-
Because my rewards are delayed and observed after few days, the output trajectories from replay buffer (while predicting),need to be updated (with the obtained rewards.)
Since I will not have user id anywhere in the trajectory, if the sequence of observations (actions, rewards) generated by environment is retained in the replay buffer, I can use the order/sequence to match the user_id and update the rewards on the trajectories (from predictions).
But, even using single_determinsitic_pass=true on the as_dataset method for the TF Uniform replay buffer, the trajectories of a batch are in random sequence and not in sequence of the generated observations. Is there a way to maintain the sequence in the driver/replat buffer too (same as the order of observations generated by the environment)