IsaacLab
IsaacLab copied to clipboard
[Proposal] Using numpy instead of torch for nonzero operation
Proposal
At several places in the code base the operation tensor.nonzero() is performed to find non-zero indices. I was going over the nonzero operation in PyTorch and made a small script to check the performance of this function. It appears that the torch implementation of this function is slower than numpy.
Motivation
Snippet:
import numpy as np
import torch
import torch.utils.benchmark as benchmark
for size in [48000, 16000, 8000, 4000, 2000, 1]:
# pretty print structure
print("\n Size of the tensor:", size)
with torch.inference_mode():
# create a random tensor
my_tensor = torch.rand(size) > 0.5
# check the speed of non-zero operation on torch with CPU
timer_nonzero = benchmark.Timer(
stmt="torch.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cpu")}
)
time_value = timer_nonzero.blocked_autorange().median
# time_value = timer_nonzero.timeit(number=1000).median
print("\tTime for non-zero (cpu, torch)\t :", time_value / 1e-6, "us")
# check the speed of non-zero operation on torch with cuda:0
timer_nonzero = benchmark.Timer(
stmt="torch.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cuda:0")}
)
time_value = timer_nonzero.blocked_autorange().median
# time_value = timer_nonzero.timeit(number=1000).median
print("\tTime for non-zero (cuda:0, torch):", time_value / 1e-6, "us")
# check the speed of non-zero operation on numpy
timer_nonzero = benchmark.Timer(
stmt="np.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cpu").numpy(), "np": np}
)
time_value = timer_nonzero.blocked_autorange().median
# time_value = timer_nonzero.timeit(number=1000).median
print("\tTime for non-zero (numpy)\t\t :", time_value / 1e-6, "us")
Output with torch 2.2.2 and numpy 1.26.0:
Size of the tensor: 48000
Time for non-zero (cpu, torch) : 353.93525403924286 us
Time for non-zero (cuda:0, torch): 35.95280998852104 us
Time for non-zero (numpy) : 110.92265602201225 us
Size of the tensor: 16000
Time for non-zero (cpu, torch) : 112.40673251450062 us
Time for non-zero (cuda:0, torch): 33.564061392098665 us
Time for non-zero (numpy) : 22.297477489337325 us
Size of the tensor: 8000
Time for non-zero (cpu, torch) : 51.131288800388575 us
Time for non-zero (cuda:0, torch): 32.911599799990654 us
Time for non-zero (numpy) : 4.779905430041254 us
Size of the tensor: 4000
Time for non-zero (cpu, torch) : 22.04942419193685 us
Time for non-zero (cuda:0, torch): 33.145466493442655 us
Time for non-zero (numpy) : 2.7112497808411717 us
Size of the tensor: 2000
Time for non-zero (cpu, torch) : 8.049722630530596 us
Time for non-zero (cuda:0, torch): 32.09983170963824 us
Time for non-zero (numpy) : 1.788965534651652 us
Size of the tensor: 1
Time for non-zero (cpu, torch) : 2.077196140307933 us
Time for non-zero (cuda:0, torch): 23.385414993390444 us
Time for non-zero (numpy) : 0.6608673920854926 us
It seems that numpy consistently outperforms torch for this operation unless the size is really large. Since most of the RL environments only scale to max of 16000 (in most cases), we should consider switching to numpy as default for these checks.
Additional context
Related issue: https://github.com/pytorch/pytorch/issues/14848
Checklist
- [x] I have checked that there is no similar issue in the repo (required)
Acceptance Criteria
- [ ] Switch to numpy arrays for this operation as it seems more performant