Small bug in the function return_range

Open yaoliucs opened this issue 2 years ago • 0 comments

In the function return_range, the end of trajectory is marked by either the terminal signal or the time steps equals to max_episode_steps. However, as the dataset is extracted from D4RL's qlearning_dataset function, the second condition (time steps equals to max_episode_steps) is not true for the dataset.

This is due to the qlearning_dataset function will threw away the last transition if a trajectory is ended by timeout. So a large part of the trajectories will only have max_episode_steps - 1 transitions in the returned dataset. Thus the episodic return calculated by function return_range will be shifted by one step reward. This probably will have a tiny effect on the max and min return.

Ref: Sepcifically, see the logic in this line from D4RL: https://github.com/Farama-Foundation/D4RL/blob/71a9549f2091accff93eeff68f1f3ab2c0e0a288/d4rl/init.py#L117

Feb 15 '23 22:02 yaoliucs