brax Replacing gym's Mujoco envs with brax envs

Had a conversation with @jkterry1 on https://github.com/openai/gym/issues/2366, and it appears brax would also be a great alternative for the mujoco envs replacement.

To help with this transition. I made an attempt to try out brax with pytorch. Here is a basic report: https://wandb.ai/costa-huang/brax/reports/Brax-as-Pybullet-replacement--Vmlldzo5ODI4MDk. The source code is here: https://github.com/vwxyzjn/cleanrl/blob/mybranch/cleanrl/brax/readme.md

One of the biggest issue with the brax adoption is the env normalization:

gym doesn't have a normalization wrapper
sb3 has a normalization wrapper but brax does not have a sb3 vector env api
brax's normalization is implemented in the training side (https://github.com/google/brax/blob/main/brax/training/normalization.py)

I think going forward, probably the best way to fix this is to refactor the brax training side's normalization to the environment side. This in the future will also help throughput with the JaxToTorchWrapper. Otherwise, the observation will go from GPU to CPU for gym or sb3's normalization wrapper, then GPU again for torch, which just doesn't make sense.

One small thing is that given the brax environment directly produces the vector env, there is also no way to inject a ClipActionsWrapper(env), which may or may not have a performance impact. That said, this can be implemented in the training side with ease.

Sep 01 '21 05:09 vwxyzjn

Yes, as I suggested previously, Brax seems a good option for OpenAI Gym, since it allows for GPU and TPU accelerators (training in minutes instead of hours), next to CPU. We can use this issue to track progress and add an itemized todo.

Sep 01 '21 21:09 erwincoumans

To recap the to do list:

Add suitable rendering
Further tune observation/action spaces to make them as close as possible
Make sure we are not reproducing the list of bugs in MuJuCo environments from Antonin Raffin that I sent you

I feel like there may have been a 4th issue, but I don't sleep very much and can no longer recall it. @erwincoumans @benelot do you remember?

Sep 05 '21 15:09 jkterry1

One note on the suitable rendering is I feel implementing env.render(“rgb_array”) might be too expensive and counterproductive. Maybe implementing env.render(“html”) at the end of episode is more preferable.

Sep 05 '21 15:09 vwxyzjn

They're planning to add a new rendering engine such that "rgb_array" will be suitable

Sep 05 '21 15:09 jkterry1

I don't know if this is the 4th feature I can't remember, but another thing we'll need to eventually deal with that I briefly discussed is action/observation space documentation for the new Gym website we're working on, in the flavor of https://www.pettingzoo.ml/classic/chess

Sep 05 '21 18:09 jkterry1

I would like to help with this, what can I do to help?

Sep 05 '21 21:09 joaogui1

@joaogui1 Probably nothing, at least at the moment. Right I'm waiting on the Brax team to do some work and for the guy who created the pybullet replacement envs to get back from vacation, this will take 4-6 weeks. If you'd like to help with gym maintenance problems in general though, please email me and we can coordinate some things ([email protected])

Sep 06 '21 14:09 jkterry1

Got it, will wait a little then, thanks!

Sep 07 '21 18:09 joaogui1

I'm also happy to help on this, I've spent a lot of time with the mujoco/pybullet environments at this point. Can certainly help with points 2/3 that @jkterry1 posted in this thread.

Sep 08 '21 18:09 sgillen

We have started working on 1) the renderer. We're looking at porting a simple technique like https://github.com/rougier/tiny-renderer to JAX as a new module in brax.io

Tuning observation/action space could start in parallel if anyone is interested. I think the steps would involve:

reset a Gym Mujoco env (say Ant) to default state and inspect the observation space and its description
compare to Brax Ant env and make adjustments
step both and compare dynamic observations (e.g. contact forces)

I think the envs are already ~80% comparable, and the last 20% is just sleuthing to read the mujoco docs, and confirm the format matches. I think we can get to the point where the meaning of each observation dimension is the same in both envs, even if the dynamics are still different.

Sep 11 '21 18:09 erikfrey

I can get that going next week. I will use Mujoco 1.5 due to this issue. It looks like the Brax environments are based off the v2 version of the Mujoco environments, so I'll start by comparing to those. Based on https://github.com/openai/gym/pull/1304 I think the v3 versions are supposed to be identical if using default args, not 100% sure that's true though.

Sep 11 '21 20:09 sgillen

This is so great to hear! I also have a quick update. Gym now has a normalization wrapper: https://github.com/openai/gym/pull/2387. The usage is roughly

env = gym.make("HalfCheetahBulletEnv-v0")
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.ClipAction(env)
env = gym.wrappers.NormalizeObservation(env)
env = gym.wrappers.TransformObservation(env, lambda obs: np.clip(obs, -10, 10))
env = gym.wrappers.NormalizeReward(env)
env = gym.wrappers.TransformReward(env, lambda reward: np.clip(reward, -10, 10))

However as I suggested earlier, this might be not as fast as implementing the normalization on brax's side. Another thing is directly applying these wrappers to brax environment won't work because some issues with jax's device array overriding numpy arrays in the wrappers.

A typical example is gym.wrappers.RecordEpisodeStatistics, and its episode_returns array will be casted to a jax array, which causes problems because jax array is not mutable.

Sep 12 '21 00:09 vwxyzjn

Ok, I was a bit busier than I expected this week, but as promised I did start comparing the ant environments this evening. Here is a notebook I was using that may be useful to anyone else who wants to compare and tweak the envs.

With regards to the observations:

I believe all the state position and velocity information match up. For Mujoco it seems to be: z + quaternion for the torso (5), 8 joint angles, dxyz/drot (6) for torso, 8 more joint velocities, which matches exactly what brax has.
The contact information is where big differences appear. Brax seems to be missing some internal bodies that are present in the Mujoco model, this accounts for the difference in observation size (The brax team was already aware of this).
I'm not sure what the ordering for the contact forces is in brax. It doesn't match what mujoco does (see the notebook linked above) and it also doesn't seem to match up with the bodies in env.sys.body_idx.keys().

with regard to rewards:

The rewards also exclude any contact force penalty because of the lack of those forces caused by a bug with gym+Mujoco 2.0 (see the issue I posted above), but I think it would be best to put them back.

If the goal is to make as faithful representation of mujoco envs as possible (which IMO it shouldn't necessarily be) then we will at least need to address the following:

The mj ant starts life suspended .75m in the air, the brax ant at .5
mj adds a relatively large amount random noise to its initial state on reset.
Inertial parameters for the two envs are different. Does brax have a way to infer an inertia from geometry? This is what mj does.
No matter the ordering , the magnitude of the force and and moment are substantially different, but that may be because of the difference in mass.
Torque limits appear different 300 in brax vs 150 in mj (units? That would be a lot of N*m)
These are minor, but may want to find out what brax integrator settings are closest to an rk4 with dt = .01.
May also want to tune friction parameters, which will probably need to be done empirically.

TLDR: For the ant the difference in observations is in the ordering and number of contact forces. To make them match exactly we would need to re order the existing forces, and insert some dummy, zeroed elements into the observation. That said the "missing" contact forces weren't useful in the old env, and the ordering of contacts shouldn't matter to an RL agent, so IMHO it would be enough to adjust the mass, inertia, and torque limit, add back in the contact force reward/penalty, and maybe add the wider distribution to initial state.

Sep 16 '21 03:09 sgillen

@vwxyzjn good to hear about the normalization wrapper, I agree that the normalization and clipping should all be done on the brax side. This makes things awkward with respect to saving and loading environments / agents, since it will make brax a special case for gym, sb3 etc. Related, I also think that if the brax envs aren't going to be extremely fast that it would better to just use pybullet.

Sep 16 '21 03:09 sgillen

@vwxyzjn we recently started using a similar Wrapper concept for wrapping envs in Brax, inspired by Gym. e.g. EpisodeWrapper collects episode statistics and sets done at the episode boundary, and so on:

https://github.com/google/brax/blob/main/brax/envs/wrappers.py#L43

I don't think it would be too hard to make the brax API mirror what gym is doing, and still keep it all on device.

Sep 16 '21 04:09 erikfrey

@sgillen this is super helpful - thanks for putting together this thorough comparison. I hear you that our envs don't need to be exactly 1:1 to MuJoCo's - that said, we'd be happy to prioritize any fixes to the differences you brought up, according to whether they:

impact training curves significantly (e.g. we noticed that adding noise to initial states does sometimes impact training)
produce a more pleasing gate (e.g. perhaps re-adding the contact force reward penalty)
... some other reason?

Of the differences you found, do you have a suggestion for which might be the most important to address?

Sep 16 '21 04:09 erikfrey

I agree with @sgillen on the tasks, but would reorder to:

add back in the contact force reward/penalty
adjust the mass, inertia, and torque limit
add the wider distribution to initial state

On 1: If we want to copy the previous env, we need it, whether it helps with training or not, otherwise we diverge. On 2: Is there any reason these were set in the brax ant env the way they are? Torque limit looks like the result of f(mass, inertia, mujoco_engine_details), so we should be able to set similar ones to mujoco. If they can not be adjusted exactly, I would suggest to fall back to the metric of "similar learning curve". In pybullet I once looked at the metric of "similar observation distribution shape" which says something about in which observational manifold the ant moves. On 3: This is certainly important for higher robustness of the learned policy. Especially in the humanoids, adding some noise during testing but not training easily messes them up.

On my side I started to play a bit with brax and built some initial version of the humanoid standup but ,being on vacation, I am not done yet. I plan to begin building a first version of all required mujoco envs next week in brax just to see how they perform. Then we can do the same for every env as @sgillen did for ant.

Sep 16 '21 07:09 benelot

Just to confirm, does the list of inconsistencies include the list of bugs in MuJuCo that we want to make sure that we aren't reproducing that I sent?

Sep 16 '21 12:09 jkterry1

@erikfrey I agree with @benelot list on what to prioritize. They will probably impact training, making the environment slightly harder if anything, but also closer to the original. The contact reward might lead to more pleasing gaits but it's hard to say.

@jkterry1 I am not sure, can you post that list of bugs here?

Sep 16 '21 16:09 sgillen

@jkterry1 possibly means those: (according to Antonin Raffin)

the broken halfcheetah is well known (blog post from 2018): https://www.alexirpan.com/2018/02/14/rl-hard.html and was even more broken recently: https://twitter.com/natolambert/status/1369139391130607625
for the walker, I thought I saw a comment from Erwin saying it was heavier to actually avoid having it run but I cannot find it again, I can probably find a video of the Mujoco walker running (the envs where inspired by roboschool: https://github.com/openai/roboschool) found it! it's in the OpenAI blog post: https://openai.com/blog/roboschool/
according to OpenAI blog post, there was also the ant and the humanoid (but the humanoid is still quite un-realistic anyway)
otherwise, there is the Swimmer that changed behavior (but this can be solved by using a high discount factor): https://github.com/hill-a/stable-baselines/issues/500

Sep 17 '21 19:09 benelot

@benelot that's the list, thanks a ton

Sep 17 '21 22:09 jkterry1

Can confirm that our HalfCheetah is at least not broken in the ways discussed in those blogs. In fact this is something we had to address in our paper comparing our envs to Mujoco's. See section E1 in the appendix for a brief discussion about this problem.

That said, I am quite prepared for folks to find new and interesting bugs as these envs get more attention! We'll be happy to address them when they come up :-)

We are 90% done on hopper. If someone would like to take a pass at Walker2d or Swimmer, please let me know. Otherwise we'll get to them soon.

Sep 21 '21 18:09 erikfrey

Quick update - we now have the Hopper env, and tomorrow we will land Walker2d. We'll also add them soon to the colab with good default hparams. Other things in flight:

We've added back the contact force penalty and shown it works well for Ant, confirming for others and then will push
@erwincoumans is making great progress on a simple, small software renderer that we will use for env.render
Wider distributions to initial state is also in progress

Sep 24 '21 07:09 erikfrey

OK! We now support state to pixels for env.render:

https://github.com/google/brax/blob/main/brax/io/image.py

Please keep in mind this is CPU rendering, so better for eval rendering and other programmatic use cases, rather than training. We will move to GPU/TPU rendering in the future, which should be suitable for training.

In the coming days we'll update our colabs with an example of how to use it.

Oct 12 '21 19:10 erikfrey

I'm trying to making Brax/MuJoCo more apples-to-apples in the setup for them. I'm not sure what major differences need to be accounted for. Is there a set of operations that need to be called on Brax to get settings as similar to MuJoCo as possible? (e.g. this normalization mentioned in this issue here)

Oct 14 '21 19:10 slerman12

Hi @slerman12, the process is still ongoing I think to make the brax environments similar to Mujoco. This thread has some info on the major differences at this point, you can see the notebook I posted above as a starting point for comparing the environments in an "apples-to-apples" way. The normalization is not a difference by itself, the Mujoco envs don't have normalization built in. Usually training frameworks like stable baselines will normalize observations from environments, but that presents some difficulty in brax.

Oct 16 '21 18:10 sgillen

Per the meeting, we still need the following things before merging into Gym:

Adding missing environments: Swimmer (Benjamin Ellenberger) Standup (Brax team) Inverted pendulum (Daniel Freeman) Inverted double pendulum (Daniel Freeman)

Remove 0s where applicable (Brax team) Remove unnecessary inheritance regarding hopper (Brax team)

Oct 20 '21 16:10 jkterry1

I have not found pusher, reacher, striker, thrower anywhere in the brax repo. I think they are required as well @jkterry1. Are they somewhere internal @cdfreeman-google?

Oct 20 '21 16:10 benelot

Reacher is here: https://github.com/google/brax/blob/main/brax/envs/reacher.py

Ah, I wasn't aware of pusher, striker, thrower as they are not here: https://gym.openai.com/envs/#mujoco

BUT I do see them here: https://github.com/openai/gym/tree/master/gym/envs/mujoco

We'll look into those on the Brax side unless anyone jumps in and would like to claim them.

Oct 20 '21 17:10 erikfrey

OK more updates:

Wider distributions to initial state is in! We jitter qpos and qvel with the exact same amounts as the current gym mujoco envs
We now have a colab that demonstrates image rendering: https://colab.sandbox.google.com/github/google/brax/blob/main/notebooks/environments.ipynb
image rendering is now hooked up to the gym env render function for mode='rgb_array'
humanoid standup is in progress!

Oct 20 '21 23:10 erikfrey