[Proposal] Using numpy instead of torch for nonzero operation

Open Mayankm96 opened this issue 1 year ago • 0 comments

Proposal

At several places in the code base the operation tensor.nonzero() is performed to find non-zero indices. I was going over the nonzero operation in PyTorch and made a small script to check the performance of this function. It appears that the torch implementation of this function is slower than numpy.

Motivation

Snippet:

import numpy as np
import torch
import torch.utils.benchmark as benchmark

for size in [48000, 16000, 8000, 4000, 2000, 1]:
    # pretty print structure
    print("\n Size of the tensor:", size)

    with torch.inference_mode():
        # create a random tensor
        my_tensor = torch.rand(size) > 0.5

        # check the speed of non-zero operation on torch with CPU
        timer_nonzero = benchmark.Timer(
            stmt="torch.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cpu")}
        )
        time_value = timer_nonzero.blocked_autorange().median
        # time_value = timer_nonzero.timeit(number=1000).median
        print("\tTime for non-zero (cpu, torch)\t :", time_value / 1e-6, "us")

        # check the speed of non-zero operation on torch with cuda:0
        timer_nonzero = benchmark.Timer(
            stmt="torch.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cuda:0")}
        )
        time_value = timer_nonzero.blocked_autorange().median
        # time_value = timer_nonzero.timeit(number=1000).median
        print("\tTime for non-zero (cuda:0, torch):", time_value / 1e-6, "us")

        # check the speed of non-zero operation on numpy
        timer_nonzero = benchmark.Timer(
            stmt="np.nonzero(my_tensor)", globals={"my_tensor": my_tensor.to("cpu").numpy(), "np": np}
        )
        time_value = timer_nonzero.blocked_autorange().median
        # time_value = timer_nonzero.timeit(number=1000).median
        print("\tTime for non-zero (numpy)\t\t :", time_value / 1e-6, "us")

Output with torch 2.2.2 and numpy 1.26.0:

 Size of the tensor: 48000
	Time for non-zero (cpu, torch)	 : 353.93525403924286 us
	Time for non-zero (cuda:0, torch): 35.95280998852104 us
	Time for non-zero (numpy)	 : 110.92265602201225 us

 Size of the tensor: 16000
	Time for non-zero (cpu, torch)	 : 112.40673251450062 us
	Time for non-zero (cuda:0, torch): 33.564061392098665 us
	Time for non-zero (numpy)	 : 22.297477489337325 us

 Size of the tensor: 8000
	Time for non-zero (cpu, torch)	 : 51.131288800388575 us
	Time for non-zero (cuda:0, torch): 32.911599799990654 us
	Time for non-zero (numpy)	 : 4.779905430041254 us

 Size of the tensor: 4000
	Time for non-zero (cpu, torch)	 : 22.04942419193685 us
	Time for non-zero (cuda:0, torch): 33.145466493442655 us
	Time for non-zero (numpy)	 : 2.7112497808411717 us

 Size of the tensor: 2000
	Time for non-zero (cpu, torch)	 : 8.049722630530596 us
	Time for non-zero (cuda:0, torch): 32.09983170963824 us
	Time for non-zero (numpy)	 : 1.788965534651652 us

 Size of the tensor: 1
	Time for non-zero (cpu, torch)	 : 2.077196140307933 us
	Time for non-zero (cuda:0, torch): 23.385414993390444 us
	Time for non-zero (numpy)	 : 0.6608673920854926 us

It seems that numpy consistently outperforms torch for this operation unless the size is really large. Since most of the RL environments only scale to max of 16000 (in most cases), we should consider switching to numpy as default for these checks.

Additional context

Related issue: https://github.com/pytorch/pytorch/issues/14848

Checklist

[x] I have checked that there is no similar issue in the repo (required)

Acceptance Criteria

[ ] Switch to numpy arrays for this operation as it seems more performant

Aug 03 '24 07:08 Mayankm96