Unrealistic rewards for InvertedPendulum

Open sahiliitm opened this issue 9 years ago • 1 comments

Hi,

Im running the code as-is for the InvertedPendulum-v1 environment. The output log looks like:

[2016-09-29 02:55:12,968] Making new env: InvertedDoublePendulum-v1
[2016-09-29 02:55:13,029] OpenGL_accelerate module loaded
[2016-09-29 02:55:13,076] Using accelerated ArrayDatatype
outdir: ddpg-results/IP/
True action space: [-1.], [ 1.] 
True state space: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [ inf  inf  inf  inf  inf  inf  inf  inf  inf  inf  inf] 
Filtered action space: [-1.], [ 1.]
Filtered state space: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf], [ inf  inf  inf  inf  inf  inf  inf  inf  inf   inf  inf]
((11,), (1,))
{'_entry_point': 'gym.envs.mujoco:InvertedDoublePendulumEnv',
 '_env_name': 'InvertedDoublePendulum',
 '_kwargs': {},
 '_local_only': False,
 'id': 'InvertedDoublePendulum-v1',
 'nondeterministic': False,
 'reward_threshold': 9100.0,
 'tags': [], 
 'timestep_limit': 1000,
 'trials': 100}
Average test return 94.6561916032 after 0 timesteps of training
Average training return 64.8650261368 after 10004 timesteps of training
Average test return 94.4441357631 after 10004 timesteps of training
Average training return 62.8453825653 after 20006 timesteps of training
Average test return 94.4936849296 after 20006 timesteps of training
Average training return 63.6538282778 after 30008 timesteps of training
Average test return 94.9548271625 after 30008 timesteps of training
Average training return 63.9039428219 after 40011 timesteps of training
Average test return 94.2871854837 after 40011 timesteps of training
Average training return 63.2686654373 after 50014 timesteps of training
Average test return 98.8836603337 after 50014 timesteps of training
Average training return 145.89652752 after 60042 timesteps of training
Average test return 295.657725759 after 60042 timesteps of training
Average training return 192.307169483 after 70066 timesteps of training
Average test return 257.732447567 after 70066 timesteps of training
Average training return 226.691339415 after 80067 timesteps of training
Average test return 473.731095604 after 80067 timesteps of training
Average training return 255.541847852 after 90069 timesteps of training
Average test return 435.084465257 after 90069 timesteps of training
Average training return 254.536465181 after 100089 timesteps of training
Average test return 630.270166648 after 100089 timesteps of training
Average training return 250.049665622 after 110105 timesteps of training
Average test return 2436.58758156 after 110105 timesteps of training
Average training return 244.717938695 after 120121 timesteps of training
Average test return 93368.0844892 after 120121 timesteps of training

And then the code kind of just exits (although I'd asked it to train for 1 million train steps). Do you experience this sort of behavior as well? I'm guessing there is a subtle bug in the code which allows it to score such high episodic returns as 94k.

Sep 29 '16 04:09 sahiliitm

Hey! The magnitude of the test return is weird. It seems like the gym is not enforcing the timestep limit for some reason. Do you get normal returns by manually setting the timestep limit by running with --tmax 1000? The code is exiting because the reward threshold has been reached but this should be definitely overridable and also be represented in the logs.

Sep 30 '16 15:09 rmst