Potential memory leak on Maniskill3?
Hi!
Thanks for open-sourcing this awesome project.
Recently, I switched to the maniskill3 branch, and I noticed that I have been getting OOM issues if I switch between too many envs.
My workflow is roughly as follows:
I make env A, do some parallel testing, make sure to close and delete this env by calling env.close() and del env, and make env B, rinse and repeat. All in a single process.
But I noticed that the VRAM does not drop when an env is closed.
I double-checked the time, and the time at which VRAM increases is indeed the time a new env is made.
The specific error message I got is this
RuntimeError: Unable to create GPU parallelized camera group. If the error is about being unable to create a buffer, you are likely using too many Cameras. Either use less cameras (via less parallel envs) and/or reduce the size of the cameras. Another common cause is using a memory intensive shader, you can try using the 'minimal' shader which optimizes for GPU memory but disables some advanced functionalities. Another option is to avoid rendering with the rgb_array mode / using the human render cameras as they can be more memory intensive as they typically have higher resolutions for the purposes of visualization.
I came across this issue on Maniskill's main repo, but it seems like the API to manually clear the assets is no longer in the codebase anymore
@StoneT2000
Could try running
import gc; gc.collect() after deleting the environment?
And what other code are you running besides the environment?
And what version of maniskill 3 is being used? git? pypi? nightly?
Could try running
import gc; gc.collect()after deleting the environment?And what other code are you running besides the environment?
Hi Stone!
Thanks for the reply. Yes, I did include gc in my code, and that doesn't change the result, which surprises me
The maniskill3 version I was using is 3.0.0b20. I don't quite remember how I installed it tho, sorry.
The code I was running works like this. On the websocket client side, an evaluator object is spawned and queries Simpler/Maniskill3 to get images and robot states. It then packs this data to send to a websocket server
The server is simply a VLA model that accepts input, generates actions, and sends them back to the client/evaluator to execute.
Currently, all things run on one single machine, but the architecture is written so that in the future I can run inference separately from the robot.
The code looks like this
def evaluate(self):
'''Run evaluation on all tasks in the task list'''
for gradient_step in self.gradient_steps:
self._initialze_model_client(gradient_step)
for task_name in self.task_lists:
self.evaluate_task(task_name)
if self.use_wandb:
wandb.log(self.wandb_metrics, step=int(gradient_step), commit=True)
@override
def evaluate_task(self, task_name):
'''
Evaluate a single task
Args:
task_name: Name of the task to evaluate
Returns:
success_rate: The success rate achieved on this task
'''
start_task_time = time.time()
task_seed = self.seed
# Initialize task-specific logging
task_log_dir = self.log_dir / task_name
video_dir = task_log_dir / "videos"
if self.main_rank:
os.makedirs(video_dir, exist_ok=True)
task_logger = setup_logger(
main_rank=self.main_rank,
filename=task_log_dir / f"{task_name}.log" if not self.debug else None, # log to console when debug is True
debug=self.debug,
name=f'{task_name}_logger'
)
task_logger.info(f"Task suite: {task_name}")
self.main_logger.info(f"Task suite: {task_name}")
# Set up environment
ms3_task_name = self.ms3_translator.get(task_name, task_name)
env: BaseEnv = gym.make(
ms3_task_name,
obs_mode="rgb+segmentation",
num_envs=self.n_parallel_eval,
sensor_configs={"shader_pack": "default"},)
cnt_episode = 0
eval_metrics = collections.defaultdict(list)
# Set up receding horizon control
action_plan = collections.deque()
while cnt_episode < self.n_eval_episode:
task_seed = task_seed + cnt_episode
obs, _ = env.reset(seed=task_seed, options={"episode_id": torch.tensor([task_seed + i for i in range(self.n_parallel_eval)])})
instruction = env.unwrapped.get_language_instruction()
images = []
predicted_terminated, truncated = False, False
images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())
elapsed_steps = 0
while not (predicted_terminated or truncated):
if not action_plan:
# action horizon is all executed
# Query model to get action
element = {
"observation.images.top": images[-1],
"observation.state": obs['agent']['eef_pos'].cpu().numpy(),
"task": instruction
}
action_chunk = self.client.infer(element)
# action chunk is of the size [batch, action_step, action_dim]
# but dequeue can only take something like [action_step, batch, action_dim]
action_plan.extend(action_chunk[:, :self.action_step, :].transpose(1, 0, 2))
action = action_plan.popleft()
obs, reward, terminated, truncated, info = env.step(action)
elapsed_steps += 1
info = common.to_numpy(info)
truncated = bool(truncated.any()) # note that all envs truncate and terminate at the same time.
images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())
for k, v in info.items():
eval_metrics[k].append(v.flatten())
if self.pipeline_cfg.eval_cfg.recording:
for i in range(len(images[-1])):
# save video. The naming is ugly but it's to follow previous naming scheme
success_string = "_success" if info['success'][i].item() else ""
images_to_video([img[i] for img in images], video_dir, f"video_{cnt_episode + i}{success_string}", fps=10, verbose=True)
cnt_episode += self.n_parallel_eval
mean_metrics = {k: np.mean(v) for k, v in eval_metrics.items()}
success_rate = mean_metrics['success']
task_eval_time = time.time() - start_task_time
# log results
self._log_summary(logger=task_logger,
cnt_episode=cnt_episode,
eval_time=task_eval_time,
success_rate=success_rate)
self._log_summary(logger=self.main_logger,
cnt_episode=cnt_episode,
eval_time=task_eval_time,
success_rate=success_rate)
if self.use_wandb:
self.wandb_metrics[task_name] = success_rate
env.close()
del env
gc.collect()
torch.cuda.empty_cache()
I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?
I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?
Thanks. I will report back when I get home and have time to test. For the time being I am using multi-processing to speed up ms2-based simpler