SimplerEnv icon indicating copy to clipboard operation
SimplerEnv copied to clipboard

Potential memory leak on Maniskill3?

Open IrvingF7 opened this issue 9 months ago • 6 comments

Hi!

Thanks for open-sourcing this awesome project.

Recently, I switched to the maniskill3 branch, and I noticed that I have been getting OOM issues if I switch between too many envs.

My workflow is roughly as follows:

I make env A, do some parallel testing, make sure to close and delete this env by calling env.close() and del env, and make env B, rinse and repeat. All in a single process.

But I noticed that the VRAM does not drop when an env is closed.

Image

I double-checked the time, and the time at which VRAM increases is indeed the time a new env is made.

The specific error message I got is this

RuntimeError: Unable to create GPU parallelized camera group. If the error is about being unable to create a buffer, you are likely using too many Cameras. Either use less cameras (via less parallel envs) and/or reduce the size of the cameras. Another common cause is using a memory intensive shader, you can try using the 'minimal' shader which optimizes for GPU memory but disables some advanced functionalities. Another option is to avoid rendering with the rgb_array mode / using the human render cameras as they can be more memory intensive as they typically have higher resolutions for the purposes of visualization.

I came across this issue on Maniskill's main repo, but it seems like the API to manually clear the assets is no longer in the codebase anymore

IrvingF7 avatar Apr 28 '25 16:04 IrvingF7

@StoneT2000

xuanlinli17 avatar May 02 '25 01:05 xuanlinli17

Could try running

import gc; gc.collect() after deleting the environment?

And what other code are you running besides the environment?

StoneT2000 avatar May 02 '25 01:05 StoneT2000

And what version of maniskill 3 is being used? git? pypi? nightly?

StoneT2000 avatar May 02 '25 01:05 StoneT2000

Could try running

import gc; gc.collect() after deleting the environment?

And what other code are you running besides the environment?

Hi Stone!

Thanks for the reply. Yes, I did include gc in my code, and that doesn't change the result, which surprises me

The maniskill3 version I was using is 3.0.0b20. I don't quite remember how I installed it tho, sorry.

The code I was running works like this. On the websocket client side, an evaluator object is spawned and queries Simpler/Maniskill3 to get images and robot states. It then packs this data to send to a websocket server

The server is simply a VLA model that accepts input, generates actions, and sends them back to the client/evaluator to execute.

Currently, all things run on one single machine, but the architecture is written so that in the future I can run inference separately from the robot.

The code looks like this

  def evaluate(self):
      '''Run evaluation on all tasks in the task list'''

      for gradient_step in self.gradient_steps:
          self._initialze_model_client(gradient_step)
          for task_name in self.task_lists:
              self.evaluate_task(task_name)

          if self.use_wandb:
              wandb.log(self.wandb_metrics, step=int(gradient_step), commit=True)

  @override
  def evaluate_task(self, task_name):
      '''
      Evaluate a single task

      Args:
          task_name: Name of the task to evaluate

      Returns:
          success_rate: The success rate achieved on this task
      '''
      start_task_time = time.time()
      task_seed = self.seed
      # Initialize task-specific logging
      task_log_dir = self.log_dir / task_name
      video_dir = task_log_dir / "videos"
      if self.main_rank:
          os.makedirs(video_dir, exist_ok=True)

      task_logger = setup_logger(
          main_rank=self.main_rank,
          filename=task_log_dir / f"{task_name}.log" if not self.debug else None,  # log to console when debug is True
          debug=self.debug,
          name=f'{task_name}_logger'
      )

      task_logger.info(f"Task suite: {task_name}")
      self.main_logger.info(f"Task suite: {task_name}")

      # Set up environment
      ms3_task_name = self.ms3_translator.get(task_name, task_name)

      env: BaseEnv = gym.make(
      ms3_task_name,
      obs_mode="rgb+segmentation",
      num_envs=self.n_parallel_eval,
      sensor_configs={"shader_pack": "default"},)

      cnt_episode = 0
      eval_metrics = collections.defaultdict(list)

      # Set up receding horizon control
      action_plan = collections.deque()

      while cnt_episode < self.n_eval_episode:
          task_seed = task_seed + cnt_episode
          obs, _ = env.reset(seed=task_seed, options={"episode_id": torch.tensor([task_seed + i for i in range(self.n_parallel_eval)])})
          instruction = env.unwrapped.get_language_instruction()

          images = []
          predicted_terminated, truncated = False, False
          images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())
          elapsed_steps = 0
          while not (predicted_terminated or truncated):
              if not action_plan:
                  # action horizon is all executed
                  # Query model to get action
                  element = {
                          "observation.images.top": images[-1],
                          "observation.state": obs['agent']['eef_pos'].cpu().numpy(),
                          "task": instruction
                          }
                  action_chunk = self.client.infer(element)

                  # action chunk is of the size [batch, action_step, action_dim]
                  # but dequeue can only take something like [action_step, batch, action_dim]
                  action_plan.extend(action_chunk[:, :self.action_step, :].transpose(1, 0, 2))

              action = action_plan.popleft()
              obs, reward, terminated, truncated, info = env.step(action)
              elapsed_steps += 1
              info = common.to_numpy(info)

              truncated = bool(truncated.any()) # note that all envs truncate and terminate at the same time.
              images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())

          for k, v in info.items():
              eval_metrics[k].append(v.flatten())

          if self.pipeline_cfg.eval_cfg.recording:
              for i in range(len(images[-1])):
                  # save video. The naming is ugly but it's to follow previous naming scheme
                  success_string = "_success" if info['success'][i].item() else ""
                  images_to_video([img[i] for img in images], video_dir, f"video_{cnt_episode + i}{success_string}", fps=10, verbose=True)

          cnt_episode += self.n_parallel_eval


      mean_metrics = {k: np.mean(v) for k, v in eval_metrics.items()}
      success_rate = mean_metrics['success']
      task_eval_time = time.time() - start_task_time

      # log results
      self._log_summary(logger=task_logger,
                             cnt_episode=cnt_episode,
                             eval_time=task_eval_time,
                             success_rate=success_rate)

      self._log_summary(logger=self.main_logger,
                             cnt_episode=cnt_episode,
                             eval_time=task_eval_time,
                             success_rate=success_rate)

      if self.use_wandb:
          self.wandb_metrics[task_name] = success_rate

      env.close()
      del env
      gc.collect()
      torch.cuda.empty_cache()

IrvingF7 avatar May 02 '25 02:05 IrvingF7

I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?

StoneT2000 avatar May 02 '25 03:05 StoneT2000

I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?

Thanks. I will report back when I get home and have time to test. For the time being I am using multi-processing to speed up ms2-based simpler

IrvingF7 avatar May 02 '25 11:05 IrvingF7