[Bug Report] gym.wrappers.RecordVideo keep increasing memory usage
Describe the bug
With a custom DirectRLEnv when enabling the video recording the cpu memory keep increasing until it saturate and the program get killed. Removing this part the memory usage is stable during the training `
if args_cli.video == True:
video_kwargs = {
"video_folder": os.path.join(log_dir, "videos", "train"),
"step_trigger": lambda step: step % args_cli.video_interval == 0,
"video_length": args_cli.video_length,
"disable_logger": True,
}
print("[INFO] Recording videos during training.")
print_dict(video_kwargs, nesting=4)
env = gym.wrappers.RecordVideo(env, **video_kwargs)`
Steps to reproduce
`
with open("./Src/skrl_config.yaml") as stream:
try:
agent_cfg = yaml.safe_load(stream)
except yaml.YAMLError as exc:
print(exc)
# create isaac environment
env_cfg = RobotEnvCfg() # custom DirectRLEnv environment
"""Train with skrl agent."""
# override configurations with non-hydra CLI arguments
env_cfg.scene.num_envs = args_cli.num_envs if args_cli.num_envs is not None else env_cfg.scene.num_envs
env_cfg.sim.device = args_cli.device if args_cli.device is not None else env_cfg.sim.device
# multi-gpu training config
if args_cli.distributed:
env_cfg.sim.device = f"cuda:{app_launcher.local_rank}"
# max iterations for training
if args_cli.max_iterations:
agent_cfg["trainer"]["timesteps"] = args_cli.max_iterations * agent_cfg["agent"]["rollouts"]
agent_cfg["trainer"]["close_environment_at_exit"] = False
# configure the ML framework into the global skrl variable
if args_cli.ml_framework.startswith("jax"):
skrl.config.jax.backend = "jax" if args_cli.ml_framework == "jax" else "numpy"
# randomly sample a seed if seed = -1
if args_cli.seed == -1:
args_cli.seed = random.randint(0, 10000)
# set the agent and environment seed from command line
# note: certain randomization occur in the environment initialization so we set the seed here
agent_cfg["seed"] = args_cli.seed if args_cli.seed is not None else agent_cfg["seed"]
env_cfg.seed = agent_cfg["seed"]
# specify directory for logging experiments
log_root_path = os.path.join("logs", "skrl", agent_cfg["agent"]["experiment"]["directory"])
log_root_path = os.path.abspath(log_root_path)
print(f"[INFO] Logging experiment in directory: {log_root_path}")
# specify directory for logging runs: {time-stamp}_{run_name}
log_dir = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + f"_{algorithm}_{args_cli.ml_framework}"
print(f"Exact experiment name requested from command line {log_dir}")
if agent_cfg["agent"]["experiment"]["experiment_name"]:
log_dir += f'_{agent_cfg["agent"]["experiment"]["experiment_name"]}'
# set directory into agent config
agent_cfg["agent"]["experiment"]["directory"] = log_root_path
agent_cfg["agent"]["experiment"]["experiment_name"] = log_dir
# update log_dir
log_dir = os.path.join(log_root_path, log_dir)
# dump the configuration into log-directory
dump_yaml(os.path.join(log_dir, "params", "env.yaml"), env_cfg)
dump_yaml(os.path.join(log_dir, "params", "agent.yaml"), agent_cfg)
dump_pickle(os.path.join(log_dir, "params", "env.pkl"), env_cfg)
dump_pickle(os.path.join(log_dir, "params", "agent.pkl"), agent_cfg)
# get checkpoint path (to resume training)
resume_path = retrieve_file_path(args_cli.checkpoint) if args_cli.checkpoint else None
# create isaac environment
env = RobotEnv(cfg=env_cfg, render_mode="rgb_array", headless=args_cli.headless)
#env = gym.make(args_cli.task, cfg=env_cfg, render_mode="rgb_array" if args_cli.video else None)
# convert to single-agent instance if required by the RL algorithm
if isinstance(env.unwrapped, DirectMARLEnv) and algorithm in ["ppo"]:
env = multi_agent_to_single_agent(env)
# adjust camera resolution and pose
env_cfg.viewer.resolution = (1920, 1080)
env_cfg.viewer.eye = (1.0, 1.0, 1.0)
env_cfg.viewer.lookat = (0.0, 0.0, 0.0)
# wrap for video recording
if args_cli.video == True:
video_kwargs = {
"video_folder": os.path.join(log_dir, "videos", "train"),
"step_trigger": lambda step: step % args_cli.video_interval == 0,
"video_length": args_cli.video_length,
"disable_logger": True,
}
print("[INFO] Recording videos during training.")
print_dict(video_kwargs, nesting=4)
env = gym.wrappers.RecordVideo(env, **video_kwargs)
# wrap around environment for skrl
env = SkrlVecEnvWrapper(env, ml_framework=args_cli.ml_framework) # same as: `wrap_env(env, wrapper="auto")`
# configure and instantiate the skrl runner
# https://skrl.readthedocs.io/en/latest/api/utils/runner.html
runner = Runner(env, agent_cfg)
# load checkpoint (if specified)
if resume_path:
print(f"[INFO] Loading model checkpoint from: {resume_path}")
runner.agent.load(resume_path)
# run training
runner.run()
# close the simulator
env.close()`
System Info
- Commit:
- Isaac Sim Version: 4.5
- OS: Ubuntu 22.04.5
- GPU: RTX 4070
- CUDA: 12.4
- GPU Driver: 550.120
Additional context
Checklist
- I have checked that there is no similar issue in the repo (required)
What values did you set for args_cli.video_length and args_cli.video_interval?
Saving videos of longer length frequently will consume RAM quickly. The RAM needs some delay to clear out the previous video frame buffers.
Try to reduce args_cli.video_length to around 500 and args_cli.video_interval to around 3000.
My values for the video_length and video_interval are 2000 and 1000 (I use the default ones)
parser.add_argument("--video_length", type=int, default=2000, help="Length of the recorded video (in steps).") parser.add_argument("--video_interval", type=int, default=1000, help="Interval between video recordings (in steps).")
What's your RAM size?
My system has 32GB of ram and I 137GB of swap memory
that's too less. try reducing video_interval and video_length. should probably work out for you.
also, if you need to check videos of longer length, you can always run inference on your checkpoints.
I've reduced the video_interval and video_length values to (50 and 500) but over time the memory keep be allocated, slower but the usage keep increase.
sorry, my bad. increase video_interval to 2000/4000 and decrease video_length to 300.
If that doesn't work, set video_interval to the max number of training iterations for your algorithm i.e., at the end of training you will have only 1 video saved.
Then, monitor the RAM usage, if it still increases then there's some bug in the other part of your code. Like saving some tensor to the disk.
I got the same problem running the following command using the newest Isaac Lab and Isaac Sim version:
./isaaclab.sh -p scripts/reinforcement_learning/rsl_rl/train.py --num_envs 4096 --task MyTask --headless --video --video_length 800
But using the previous Isaac Lab version I never had memory issues
I have the same issue. I also tried changing --video (green curve) to --enable_cameras (orange curve) to train visual RL policies without saving the video. However, the memory still increases linearly. I haven't observed such a dramatic increment for state-based policies (without camera rendering).
I've upgrade my system to 64GB of RAM but the problem still there.... The memory usage whatever I change keep increasing.
Same issue here. Increasing the interval and decreasing the length did not help because at the start of training i.e. at 0th iteration itself the memory usage gets soo high that Linux OOM Killer starts killing processes.
Is there any solution?
I tried disabling the visual components from the robot to make rendering easier but still no help
I have 64GB RAM BTW
Hi,
This is because the moviepy reference to video sequence didn't get release properly. I can mannually release the sequence and the memory issue will disappear by adding line 408(refer first image) in the file python3.10/site-packages/gymnasium/wrappers/rendering.py.
If your workflow allows you to edit the gymnasium code, you can do what I show there, if that is not possible, you can import gc; gc.collect() in your wrapper.step() function so the garbage collector force releases the memory
it seems like gynamisum https://github.com/Farama-Foundation/Gymnasium/pull/1378 is aware of this issue, and they added a more proper way to fix it. I hope this provide some context for most reasonable step/workaround to take before we can pip install the newer version.