mlsh Reproducing agent performance in MovementBandits

Hi, great work on the paper and code! I am working on a project that builds on top of MLSH. We implemented our own GPU optimized version of the algorithm based on your MPI based code. We observe both in our setting as well as with your code for MovementBandits that both the sub policies end up learning the same strategy of moving to just one of the bandits (the same one consistently) in a single run.

Here are the parameters from my run mpirun -np 120 python3 main.py --task MovementBandits-v0 --num_subs 2 --macro_duration 10 --num_rollouts 2000 --warmup_time 9 --train_time 1 --replay False MovementBandits

Additionally, I tried both optimizer step sizes of 3e-4 (as mentioned in the paper) and 3e-5 (embedded as a default argument in code) and changed the seed value of 1401 also embedded in your main file.

I modified master.py to log some additional information such as the current iteration number and the real goal chosen by randomizeCorrect (our fork). Here is a snippet from one of the runs

Mini ep 10, goal 1, iteration 30: global: 18.60333333333333, local: 42.5
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 2.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 2.725
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 43.075
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 1.05
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 3.125
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 44.375
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.275
Mini ep 4, goal 1, iteration 36: global: 18.60333333333333, local: 43.65
Mini ep 7, goal 0, iteration 33: global: 18.60333333333333, local: 5.575
Mini ep 1, goal 0, iteration 30: global: 18.60333333333333, local: 3.975
Mini ep 8, goal 0, iteration 32: global: 18.60333333333333, local: 4.0
Mini ep 2, goal 0, iteration 38: global: 18.60333333333333, local: 0.475
Mini ep 5, goal 1, iteration 35: global: 18.60333333333333, local: 41.4
Mini ep 3, goal 0, iteration 37: global: 18.60333333333333, local: 0.975

We observe similar behavior (confirmed upon visualization of the subpolicies with render, both of them cause the agent to move to the same disc throughout the entire training run) with our own implementation.

We are attempting to reproduce the results from the paper (figure 4, page 6) where the agent learns to get rewards around 40 after a few gradient updates. Please let us know if we are running the right hyperparameter configuration/what seeds to use with the original codebase to observe such behavior; this will greatly help with our research! Thanks.

Apr 26 '18 20:04 ysaibhargav

In my case the two sub policies learns to aim for different goal points, But the master policy leads the agent to a single goal.

Feb 19 '19 20:02 chaonan99

I reproduce the results too, but it takes many more samples to converge than reported in the paper. Has anyone else met this phenomenon before?

Oct 23 '19 00:10 SiyuanLee

@SiyuanLee I am also trying to reproduce the results given in the paper. I directly ran the command from the README for AntBandits. After over 1400 iterations it hasn't converged or seemed to improve at all. Did you observe any high sensitivity to random seeds / were you able to reproduce AntBandits?

Oct 24 '19 00:10 jhejna