icml2021-pengqlambda
icml2021-pengqlambda copied to clipboard
Revisiting Peng's Q(lambda) for Modern Reinforcement Learning
Revisiting Peng's Q(lambda) for Modern Reinforcement Learning @ ICML 2021
This is the open source implementation of a few important multi-step deep RL algorithms discussed in the ICML 2021 paper. We implement these algorithms in combination mainly with TD3, an actor-critic algorithm for continuous control.
The code is based on the deep RL library of SpinningUp. We greatly appreciate the open source efforts of the library!
This code base implements a few multi-step algorithms, including
Installation
Follow the instructions for installing SpinningUp, you might also need environment libraries such as Gym and MuJoCo.
You might also want to check out DeepMind control suite and Pybullet to train with other environments.
Introduction to the code structure
The code is located under the sub-directory spinup/algos/tf1/td3_peng. Two main files implement the algorithms.
- The file
td3_peng.pyimplements Peng's Q(lambda) and uncorrected n-step, with a deterministic policy. - The file
td3_retrace.pyimplements Retrace and tree-backup, with a stochastic policy.
A few important aspects of the implementation.
- We use n-step replay buffer that collects and samples partial trajectories of length
n. We implement n-step transition collection by an environment wrapper inwrapper.py. The buffer is implemented in the main files. - We compute Q-function targets with two critics to reduce over-estimation. Targets are computed with Peng's Q(lambda), n-step, Retrace or tree-backup, in a recursive manner.
- We share hyper-parmaeters (including architectures, batch size, optimizer, learning rate, etc) as the original baseline as much as possible.
Running the code
To run Peng's Q(lambda) with delayed environment (with k=3), n-step buffer with n=5, run the following
python td3_peng.py --env HalfCheetah-v1 --seed 100 --delay 3 --nstep 5 --lambda_ 0.7
To run n-step with delayed environment (with k=3), n-step buffer with n=5, run the following
python td3_peng.py --env HalfCheetah-v1 --seed 100 --delay 3 --nstep 5 --lambda_ 1.0
To run Retrace with delayed environment (with k=3), n-step buffer with n=5, just set lambda=1.0 and run the following
python td3_retrace.py --update-mode retrace --env HalfCheetah-v1 --seed 100 --delay 3 --nstep 5 --lambda_ 1.0
Examine the results
The main files log diagnostics and statistics during training to the terminal. Each run of the main file also saves the evaluated returns and training time steps to a newly created sub-directory.
Citation
If you find this code base useful, you are encouraged to cite the following paper
@article{kozuno2021revisiting,
title={Revisiting Peng's Q ($$\backslash$lambda $) for Modern Reinforcement Learning},
author={Kozuno, Tadashi and Tang, Yunhao and Rowland, Mark and Munos, R{\'e}mi and Kapturowski, Steven and Dabney, Will and Valko, Michal and Abel, David},
journal={arXiv preprint arXiv:2103.00107},
year={2021}
}