[Bug Report] Cuda Error when use body_physx_view.apply_forces_and_torques_at_position in multi-GPU Training
When Run Demo python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Quadcopter-Direct-v0 --headless --distributed, I get this error.
Steps to reproduce
Run python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 source/standalone/workflows/rl_games/train.py --task=Isaac-Quadcopter-Direct-v0 --headless --distributed directly.
====================broadcasting parameters
2024-06-15 06:54:19 [20,730ms] [Error] [omni.physx.tensors.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.tensors/plugins/gpu/GpuRigidBodyView.cpp: 835
2024-06-15 06:54:19 [20,730ms] [Error] [omni.physx.tensors.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.tensors/plugins/gpu/ThrustUtils.h: 40
Traceback (most recent call last):
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/standalone/workflows/rl_games/train.py", line 150, in <module>
main()
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/standalone/workflows/rl_games/train.py", line 142, in main
runner.run({"train": True, "play": False, "sigma": None})
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133, in run
self.run_train(args)
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116, in run_train
agent.train()
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1317, in train
step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1181, in train_epoch
batch_dict = self.play_steps()
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 751, in play_steps
self.obs, rewards, self.dones, infos = self.env_step(res_dict['actions'])
File "/home/makai1/miniconda3/envs/isaaclab/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 518, in env_step
obs, rewards, dones, infos = self.vec_env.step(actions)
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/extensions/omni.isaac.lab_tasks/omni/isaac/lab_tasks/utils/wrappers/rl_games.py", line 328, in step
return self.env.step(action)
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/extensions/omni.isaac.lab_tasks/omni/isaac/lab_tasks/utils/wrappers/rl_games.py", line 241, in step
obs_dict, rew, terminated, truncated, extras = self.env.step(actions)
File "/data_new/newhome/makai1/.local/share/ov/pkg/isaac-sim-4.0.0/exts/omni.isaac.ml_archive/pip_prebundle/gymnasium/wrappers/order_enforcing.py", line 56, in step
return self.env.step(action)
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/extensions/omni.isaac.lab/omni/isaac/lab/envs/direct_rl_env.py", line 276, in step
self.scene.write_data_to_sim()
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/extensions/omni.isaac.lab/omni/isaac/lab/scene/interactive_scene.py", line 289, in write_data_to_sim
articulation.write_data_to_sim()
File "/data_new/newhome/makai1/code/GES-low-level-IsaacLab/source/extensions/omni.isaac.lab/omni/isaac/lab/assets/articulation/articulation.py", line 189, in write_data_to_sim
self._body_physx_view.apply_forces_and_torques_at_position(
File "/data_new/newhome/makai1/.local/share/ov/pkg/isaac-sim-4.0.0/extsPhysics/omni.physics.tensors/omni/physics/tensors/impl/api.py", line 926, in apply_forces_and_torques_at_position
if not self._backend.apply_forces_and_torques_at_position(force_data_desc, torque_data_desc, position_data_desc, indices_desc, is_global):
RuntimeError: copy_if failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
2024-06-15 06:54:20 [20,984ms] [Error] [omni.physx.fabric.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.fabric/plugins/DirectGpuHelper.cpp: 328
2024-06-15 06:54:20 [20,984ms] [Error] [omni.physx.fabric.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.fabric/plugins/DirectGpuHelper.cpp: 331
2024-06-15 06:54:20 [20,984ms] [Error] [omni.physx.fabric.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.fabric/plugins/DirectGpuHelper.cpp: 334
2024-06-15 06:54:20 [20,984ms] [Error] [omni.physx.fabric.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.fabric/plugins/DirectGpuHelper.cpp: 337
2024-06-15 06:54:20 [20,984ms] [Error] [omni.physx.fabric.plugin] CUDA error: an illegal memory access was encountered: ../../../extensions/runtime/source/omni.physx.fabric/plugins/DirectGpuHelper.cpp: 340
System Info
- Commit: 3f96602eef96a70e680ca184685e7972948c8f40
- Isaac Sim Version: 4.0.0
- OS: Ubuntu 22.04
- GPU: RTX 4090
- CUDA: 12.2
- GPU Driver: 535.171.04
https://forums.developer.nvidia.com/t/multiple-isaac-sim-containers-on-one-gpu-fails-with-cuda-illegal-memory-access-in-omni-physx-tensors-plugin/268134/10 I think this bug has the same problem with this one.
Dev just tested this example with 4 nodes, 8 GPUs each and didn't hit any errors:( I was wondering would a restart help?
Hi @mk2001233 - Are you still seeking assistance with this issue? and have you tried it with latest Isaac Sim and Isaac Lab release?