cuda-python Perf: Reduce `StridedMemoryView` construction time

Currently it takes 3.4 - 3.45 us (depending on stream-ordering or not) to create a memory view object:

In [4]: x = cp.empty((23, 4))

In [7]: %timeit s = StridedMemoryView(x, -1)
3.4 μs ± 8.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit s = StridedMemoryView(x, 1)
3.45 μs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

which could be a bit expensive in a tight loop. We should try to reduce it down to 1 us or O(100) ns if possible.

cc @shwina for vis

Feb 14 '25 03:02 leofang

Among the 3.4 us run time, a rough capture of major components is

calling CuPy's __dlpack_device__() and __dlpack__(): 1.23 us
constructing Python objects like shape, strides, and dtype: ~1 us

So it seems to me we are talking about pre-mature optimization here... We could defer the Python object construction but then 50% of the run time is out of cuda.core's control.

Feb 18 '25 14:02 leofang

cc @NaderAlAwar for vis

Jul 15 '25 17:07 leofang

It'd be interesting to see if lazily populating the attributes of StridedMemoryView could help reduce the construction time. A lot of times we don't need the full information, just pointer and shape/strides.

Jul 25 '25 15:07 leofang