Perf: Reduce `StridedMemoryView` construction time
Currently it takes 3.4 - 3.45 us (depending on stream-ordering or not) to create a memory view object:
In [4]: x = cp.empty((23, 4))
In [7]: %timeit s = StridedMemoryView(x, -1)
3.4 μs ± 8.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [8]: %timeit s = StridedMemoryView(x, 1)
3.45 μs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
which could be a bit expensive in a tight loop. We should try to reduce it down to 1 us or O(100) ns if possible.
cc @shwina for vis
Among the 3.4 us run time, a rough capture of major components is
- calling CuPy's
__dlpack_device__()and__dlpack__(): 1.23 us - constructing Python objects like shape, strides, and dtype: ~1 us
So it seems to me we are talking about pre-mature optimization here... We could defer the Python object construction but then 50% of the run time is out of cuda.core's control.
cc @NaderAlAwar for vis
It'd be interesting to see if lazily populating the attributes of StridedMemoryView could help reduce the construction time. A lot of times we don't need the full information, just pointer and shape/strides.