sparse Perform each operation only in the correct formats

Based on https://github.com/scipy/scipy/issues/8162#issuecomment-344469036 I'm opening this issue in order to debate the merits and demerits of writing multiple implementations for every operation.

I'm specifically referring to the third bullet point in that comment, quoted below:

It'll remain important to give users a choice of the sparse storage format for efficiency reasons, but we'll want a design that doesn't guarantee that all operations on a given format will produce results of the same format. This is especially relevant when considering operations that produce results with different dimensions than the input, like slicing or extracting a diagonal.

I'm taking this to mean that we don't want to implement every single operation for every single format, we can convert to and return the most appropriate type.

I believe this is wise advice, on the whole. For example:

Writing/modification is best done in DOK. We could require that users do arr.asformat('dok') before writing to an array.
Elemwise is best done in CSD or COO (still have some Numba kinks to figure out for CSD). We could automatically convert to and return objects of these formats.
Reductions are best done in CSD (which basically makes them equivalent to ufunc.reduceat), but will usually return COO.
Slicing is extremely inefficient in DOK, but can be done after conversion to COO, for example. CSD slicing is also easy.
triu, tril, nonzero are best in COO, and essentially require conversion before they're actually used.

On the whole, I'd like for these conversions to be automatic. Writing is the only thing I can think of that will require explicit conversion to DOK.

Edit: Link to parallel SciPy-dev thread: https://mail.python.org/pipermail/scipy-dev/2018-May/022834.html

May 04 '18 11:05 hameerabbasi

cc @perimosocordiae @rgommers @mrocklin for thoughts, opinions and feedback.

Additionally, @perimosocordiae, if you could ping other people involved in scipy.sparse that'd be great, as this decision is a rather big and impactful one.

May 04 '18 11:05 hameerabbasi

In the current scipy.sparse module, operations have a default implementation that converts to a basic type and calls its implementation. The return value is not converted back to the original type.

I agree with the original design but can see both sides have merit. If the user wants a particular type back, he can convert it after the operation, which might be a little annoying, but if he doesn't care what type he gets back or prefers the basic type, he cannot undo the (possibly expensive) expensive conversion operation. Often, operations are chained (transpose then reshape then matrix multiply), so that many steps will burn CPU cycles simply doing and undoing conversions.

May 04 '18 12:05 drhagen

I completely agree, thanks for the perspective!

While it's certainly possible to write "native" operations for every single format, it's wasteful and usually more inefficient than performing things in the "basic" format for that operation.

It still remains to decide what to do when writing to a sparse array. Only the DOK and LIL formats are mutable, and it'll be rather impossible to change the type of an array inside __setitem__. I propose requiring to convert to DOK in this case.

May 04 '18 13:05 hameerabbasi

Also cc @pv, @ev-br.

I agree that forcing sparse methods to return the same format is ultimately going to be a performance hit, while the workarounds for users who want to keep a certain format are pretty simple.

One issue we've run into with scipy.sparse is that when sparse methods are chained, it can incur surprising conversion costs. For example, if you start with a DOK matrix a, and want to compute a.T.dot(a).nonzero():

a.T is implemented by dok_matrix, and returns DOK format
a.dot(...) requires a conversion to CSR format, and returns CSR format
- a.tocsr() requires conversion from DOK -> COO -> CSR
.dot(a) requires a conversion to CSR as well, again with the COO intermediate
.nonzero() requires a CSR -> COO conversion

In total, that makes 5 format conversions! I may be mis-remembering some internal details, but these situations do arise frequently. An expert user (who knows the implementation of the library) would see this case and optimize it fairly easily:

tmp = a.tocsr()
tmp.T.dot(tmp).nonzero()

But an ideal sparse library would make this sort of trap harder to fall into. In scipy.sparse, I've tried to add more "native" implementations of common methods, especially for COO format. Even if the COO implementation is slower than the CSR version, it's still a win when the combined a.tocsr().method() is slower than a.method().

As in my example, things get even slower when you don't have native conversion methods between all sparse formats. If you think about each format as a node in an undirected graph with an edge to each format it supports direct conversion to, it's much easier to visualize the cost of various operations.

One thing I've thought about adding to scipy.sparse is a debugging tool that logs every format conversion, to help find particularly bad cases of conversion hell. In a perfect world, you'd have lazy semantics and optimize an entire expression to eliminate unnecessary conversions completely.

May 04 '18 13:05 perimosocordiae

One thing we could do to prevent this sort of "conversion hell" is probably the following:

We define a AmbidextrousSparseArray class that is just a stub around all the other implementations. It would convert the "internal representation" of an array into a "native-ish" class before any operations.

This won't help in your case though, and I have serious doubts about how good it will be practically, but it's an idea.

May 04 '18 14:05 hameerabbasi