sparse Support Everything that XArray Expects

https://github.com/bolt-project/bolt/issues/58

[x] single argument ufuncs (sin, exp, etc.) and ufunc like functions (pd.isnull, notnull, astype, around, isclose)
[x] broadcasting binary operations (e.g., ndarray arithmetic)
[x] three argument version of where (preferably with broadcasting)
[x] aggregation methods
- [x] max/min/sum/prod
- [x] argmax/argmin/std/var
- [x] nan-skipping aggregations (e.g., nanmin, nanmax, nanprod, nansum)
[x] indexing with integer arrays, booleans, slices, None
[x] transpose
[ ] indexing:
- [x] basic indexing (int/slice)
- [x] outer indexing for a single array
- [x] outer indexing (int, slice and 1d integer arrays separately applied to each axis)
- [ ] vectorized indexing (integer arrays with broadcasting, like NumPy)
[x] broadcast_to (NumPy 1.10)
[x] concatenate and stack (NumPy 1.10)

Apr 17 '17 03:04 mrocklin

I updated the check-list (apparently I'm an "Owner" for pydata) to drop "insert/fancy indexing" and add a three-item checklist for indexing instead.

Jan 30 '18 17:01 shoyer

Do we have partial XArray compatibility or does XArray fail with sparse arrays outright?

Jan 30 '18 18:01 hameerabbasi

@hameerabbasi Not yet, see https://github.com/pydata/xarray/issues/1375. But sparse is close enough that it might be worth trying soon.

From a usability perspective, the most helpful additional feature would be nan-skipping aggregations. I know sparse will probably rarely be used with NaNs, but xarray's aggregation methods default to skipna=True so this would be valuable for consistency.

In addition to the direct uses, xarray uses where and outer/vectorized indexing internally primarily for alignment with join='outer' (e.g., in concat or merge). This sort of alignment would require making sparse arrays dense, unless they have a fill value of NaN rather than 0. So these operations are less essential than they might otherwise seem.

Jan 30 '18 18:01 shoyer

I just tested with your example code, NaN-skipping aggregations use ufuncs with reduce, so they should already be supported in theory.

>>> class A(np.ndarray):
...     def __array_ufunc__(self, *args, **kwargs):
...         print('ufunc', args, kwargs)
...         
>>> np.nansum(np.arange(5).view(A))
ufunc (<ufunc 'add'>, 'reduce', A([0, 1, 2, 3, 4])) {'axis': None, 'dtype': None, 'keepdims': False}

Although I have absolutely now idea how to discern that this was NaN skipping from the args/kwargs.

Jan 30 '18 18:01 hameerabbasi

Ah, it was a dtype issue.

>>> np.nansum(np.arange(5, dtype=np.float).view(A))
ufunc (<ufunc 'isnan'>, '__call__', A([0., 1., 2., 3., 4.])) {}
ufunc (<ufunc 'add'>, 'reduce', A([0., 1., 2., 3., 4.])) {'axis': None, 'dtype': None, 'keepdims': False}

So we're good to go on those already. :-)

Edit: Concrete example:

>>> ar = sparse.DOK((2, 2))
>>> ar[1, 1] = np.nan
>>> s = sparse.COO(ar)
>>> np.nansum(s)
0.0

Jan 30 '18 18:01 hameerabbasi

FWIW this is a great demonstration of the value of NumPy's use of protocols. It would be great if we could get an alternative ndarray implementation, sparse, to work with a complex downstream user of NumPy code, XArray, without either library having to explicitly know about the other. cc @njsmith

Jan 30 '18 19:01 mrocklin

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

Jan 30 '18 19:01 hameerabbasi

NumPy's nan-aggregations work, but only by converting sparse arrays into numpy arrays, inside np.lib.nanfunctions._replace_nan:

import numpy as np
import sparse

class NoncoercibleCOO(sparse.COO):
  def __array__(self, *args, **kwargs):
    raise Exception('cannot convert to numpy')
  
ar = sparse.DOK((2, 2))
ar[1, 1] = np.nan
s = NoncoercibleCOO(ar)
np.nansum(s)

This results in:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-13-481c3bc38f6d> in <module>()
      9 ar[1, 1] = np.nan
     10 s = NoncoercibleCOO(ar)
---> 11 np.nansum(s)

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in nansum(a, axis, dtype, out, keepdims)
    579 
    580     """
--> 581     a, mask = _replace_nan(a, 0)
    582     return np.sum(a, axis=axis, dtype=dtype, out=out, keepdims=keepdims)
    583 

/usr/local/lib/python3.6/dist-packages/numpy/lib/nanfunctions.py in _replace_nan(a, val)
     62 
     63     """
---> 64     a = np.array(a, subok=True, copy=True)
     65 
     66     if a.dtype == np.object_:

<ipython-input-13-481c3bc38f6d> in __array__(self, *args, **kwargs)
      4 class NoncoercibleCOO(sparse.COO):
      5   def __array__(self, *args, **kwargs):
----> 6     raise Exception('cannot convert to numpy')
      7 
      8 ar = sparse.DOK((2, 2))

Exception: cannot convert to numpy

So this really isn't a good solution yet :).

You might actually view this as the flip side of implementing protocols (like __array__): you get default implementations, but they aren't always good ones and it's not always obvious when that's the case.

Jan 30 '18 19:01 shoyer

Is this already reported upstream in NumPy?

Jan 30 '18 19:01 mrocklin

Hmm. So for this what we really want is something like: implement nansum using which=, and make reductions support which=?

Jan 30 '18 19:01 njsmith

I was also considering sparse arrays with different fill values. However, some operations (such as dot) blow up in complexity, and on others, it's fairly easy to support.

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

Is this already reported upstream in NumPy?

I don't know if this really qualifies as a bug, so much as an unintended consequence. I think the easiest fix would be to replace np.copyto() in _replace_nan with a protocol dispatching version of the three argument variant of np.where().

Jan 30 '18 19:01 shoyer

A simpler option would be to add skipnan= in ufunc reduce and accumulate methods.

Edit: And reduceat.

Edit: Or maybe in ufuncs themselves. ufunc(NaN, NaN) = NaN, otherwise, return the non-NaN item.

Jan 30 '18 19:01 hameerabbasi

I think this could be quite interesting indeed. NaN (which pandas's SparseArray uses) is the most obvious other default value to support. Even dot could be make to work if you make a nan-skipping version (nandot, nanmatmul?).

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

You might, for example, do x[x == 0] = np.nan (please note, multidimensional boolean indexing not currently supported) and it would work.

Jan 30 '18 20:01 hameerabbasi

I meant supporting arbitrary fill values. This would make many dense operations automatically possible, e.g. y = x + 5, y would have a fill value of 5.

Yes, I see the elegance in that.

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

Jan 30 '18 20:01 shoyer

NaN has a nice property (like zero) that NaN * anything -> NaN. This could make some operations easier to implement if the general fill value is not viable.

The fill value is viable, and pretty easy to implement. I would just need to make a decorator that could be used to filter out operations that require a zero fill value, and could be used like this:

@zero_fill_value
dot(a, b)
    ...

Jan 30 '18 20:01 hameerabbasi

I recently raised an issue of nan-aggregation in numpy/numpy#10456, but it looks a little hard to implement nan-aggregations without copy.

Jan 31 '18 00:01 fujiisoup

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

If someone can come up with an easier way to do N-way broadcasting of sparse arrays rather than just repeating them over and over... I'd be all for it. Broadcasts in Numpy are views... Here they're separate objects (repeats), with all the memory and performance losses that come with that.

Edit: Also the performance speedups when one of the operators returns zeros when one side is zero (like and and multiply) are significant.

Edit 2: Currently, I've optimised _elemwise_binary so that it doesn't have to repeat them. That's hard to do for N-way broadcasts, which is the main issue here.

Feb 02 '18 09:02 hameerabbasi

I just realized there is a complicated (yet, hiding in plain sight) way to recursively reduce an N-way broadcast down into a 2-way/1-way (without losing out on any performance benefits). It will also simplify the code hugely. I'll try it today and tomorrow, maybe I'll come up with something genius.

Feb 02 '18 10:02 hameerabbasi

The main issue I see with the three-argument version of where is three-way broadcasting of sparse arrays. You have to match 7 combinations of "Is it in this one's dimensions but not the other one's?"

In practice broadcasting for where is often trivial, with one or both of the second and three arguments as scalars, e.g., where(x < 0, np.nan, x). But it's nice to support broadcasting for completeness, even if it will be slower.

Feb 02 '18 18:02 shoyer

The broadcasting is turning out to be more complex than I anticipated (it took the most of last weekend).

On the plus side:
- I've got an algorithm down for general N-ary broadcasting.
- It's no worse in big-O for any case (multiplication for N arrays will sidestep most of the computation)
- The code is generally more elegant.
On the down side:
- It has to do an exponential amount of matches (exponential in the number of inputs, concretely 2^N - 1 for N inputs), but that's unavoidable for a general elemwise function, with or without broadcasting.
  - For this reason, it's better to do x * y * z rather than sparse.elemwise(lambda x, y, z: x*y*z, x, y, z)
  - where will require 7 matches when all 3 inputs are COO but but none in the trivial case. However, it'll benefit from skips. In the case you described, it'll act exactly as a binary function, requiring only 3 matches. (Scalars are not counted in the total inputs and are functools.partial-d away).
- I had to give up some performance on mixed ndarray-COO combinations. I'm working on it, it might be possible to put this back in.
  - Is it worth supporting this? I'm considering dropping it entirely and requiring a conversion to COO. It will also put roadblocks in the way of fill values.
- A bit of caching is still required to get this optimal.

Feb 05 '18 14:02 hameerabbasi

@hameerabbasi what exactly is a involved in the "match" you're proposing?

My initial thought is that an algorithm for this could simultaneously iterate over all input arguments exactly once each. Assuming sorted coordinates, this looks something like:

# warning: untested!
iter_coords = [iter(input.coords) for input in inputs]
iter_datas = [iter(input.data) for input in inputs]
indices = [next(it) for it in iters]
result_coords = []
result_data = []
num_finished = 0
while True:
    current_index = min(indices)
    current_data = [next(it_data) if current_index == index else input.fill_value
                    for it_data, index, input in zip(iter_data, indices, inputs)]
    result_coords.append(current_index)

    # in reality, we probably save an explicit indexer array in each input's data,
    # using -1 for fill-values, and then compute the result data once by applying
    # it to 1D vectors
    result_data.append(func(*current_data))

    next_indices = []
    for it, index in zip(iter_coords, indices):
        if current_index == index:
            index = next(it, default=None)
            if index is None:
                num_finished += 1
                if num_finished == len(inputs):
                    break
        next_indices.append(index)

This would runtime O(NK), where N=sum(input.coords.size for input in inputs) and K=len(inputs). Of course, it would need to be implemented in something fast (e.g., Cython, Numba, C++) to be viable.

Feb 05 '18 23:02 shoyer

Since my last comment was a mishmash of incoherent half-formed thoughts, I'm re-commenting.

This is much simpler than what I had in mind, and can be easily extended to a broadcasting scenario. The main problem I see with something like this is the following:

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

If this optimization can somehow be ported to your method (I don't see how, unfortunately, while keeping broadcasting), I'd be perfectly willing to write up a Cython equivalent of it.

To answer your original question, a "match" is a calcultation of where input 1 is nonzero, input 2 is zero, and so on... For every possible combination.

Feb 06 '18 16:02 hameerabbasi

Consider two arrays a and b. Here, a.shape = (10**5,), and b.shape = (10**5, 10**5). Also, a.nnz = b.nnz = 1. Now suppose I'm computing a * b. We currently optimize this by noticing that func(0, b.data) = 0 and func(a.data, 0) = 0. Therefore, we compute the "full matches", and don't compute the "partial matches". If you were to calculate a * b and a + b in this situation, you would see the speed difference.

It seems like the solution we need here is a "sparse broadcast" function, which only repeats non-zero indices in the result. . The algorithm would look something like:

Find all indices that are non-zero for each dimension on one of more arrays.
When broadcasting, only fill in these these indices.

"sparse" vs "regular" broadcasting could be chosen based on the nature of the operator being applied (e.g., * vs +)

Feb 06 '18 17:02 shoyer

That's (roughly) what I'm doing with my matching setup. I do the following (in psuedocode):

iterate through all possible combinations of zero/nonzero inputs:
    match nonzero inputs' coordinates (find coordinates that match)
    find func(corresponding_data, zeros)
    filter zeros from func (and corresponding coordinates) (this does the optimization I describe)
    match matched coordinates with with every zero-input and filter out those
    Broadcast leftover coords/data
    add coords, data to a list

concatenate all coords and data lists

Edit: The matching setup is on a branch on my fork, it isn't in the main codebase, and it isn't in working condition yet. It needs more work.

Edit 2: Choosing it based on the operator is still troublesome for some edge cases involving np.inf or np.nan.

Feb 06 '18 18:02 hameerabbasi

I don't think it's necessary to iterate through every possible combination of zero/nonzero inputs. Maybe something like:

Based on the operator, pick a version of broadcasting to do ("dense" broadcasting like NumPy for + or "sparse" broadcasting that only repeats non-zero elements for *)
Figure out the target shape and sparsity structure across all the inputs
"Broadcast" each input to the target shape / sparsity. This requires copying data, but should only add a constant factor overhead for the operation.
All broadcast inputs now have the same sparsity structure, so applying an element-wise NumPy function is as simply as applying it to the data on each array, e.g., func(*[inp.data for inp in broadcast_inputs]).

If you are concerned about the (constant factor) overhead for explicit broadcasting in cases like 2 * x, you could add a shortcut that converts these operators into "unary" operators that are only applied to a single sparse array.

Feb 06 '18 20:02 shoyer

What I'm more concerned about is:

The maintenance overhead of this approach. For example, fill-values will be affected by this (it won't generalize to arbitrary fill values).
If there's an nan or inf in there, the values won't match Numpy (because np.nan * 0 = np.nan and np.inf * 0 = np.nan, thus we need "dense" broadcasting in this case). To fix this, we would have to make the "switch" between the two approaches more complex.
It is guaranteed to not kick in for user-defined functions. For example, sparse.elemwise(lambda x,y:x*y, x, y).
It won't work kick in for all cases, even for us. Such as np.where(x < 0, 0, x).

Some things that could be made independent of the implementation:

Constants or 0-d arrays are already functools.partial-d away, by a custom wrapped function in my implementation. Even np.where(x < 0, np.nan, x) would be treated as a binary function equivalent to lambda a, b: np.where(a, np.nan, b) with a = x < 0 and b = x in my implementation, and the cost would be the same as a binary function. Although, of course, we could further bring down the cost by doing sparse.elemwise(lambda x: np.where(x < 0, np.nan, x), x) (which would be treated as unary). Of course, this could be ported to any implementation, so is not a downside/upside.
np.where(x < 0, np.nan, 5) or similar would already be treated as a unary function (zero matches!)

To be perfectly clear, I would prefer your approach much more (it is undeniably more performant in the "dense" broadcasting case) if these edge-cases etc. could be nicely handled automatically.

Feb 07 '18 10:02 hameerabbasi

For example, fill-values will be affected by this (it won't generalize to arbitrary fill values).

The property you need here is op(fill_value, x) == fill_value for all x (or op(x, fill_value) == fill_value). In general this only holds for particular combinations of op and fill_value, e.g.,

0 * x -> 0
x * 0 -> 0
0 / x -> 0
1 ** x -> 1
Almost any operation with NaN -> NaN
Various other identities with inf and -inf

So sure, the way this works is fill-value specific, but you could store the list of these identities and operations in a table.

If there's an nan or inf in there, the values won't match Numpy (because np.nan * 0 = np.nan and np.inf * 0 = np.nan, thus we need "dense" broadcasting in this case). To fix this, we would have to make the "switch" between the two approaches more complex.

This feels like a separate issue to me. How do you handle this in your current approach?

One option is to keep track of the finiteness of sparse arrays, e.g., by adding a (possibly cached) all_finite property.

It is guaranteed to not kick in for user-defined functions. For example, sparse.elemwise(lambda x,y:x*y, x, y).

I think I finally understand your approach: you are actually computing func(inputs[0].data, 0), func(0, inputs[1].data), etc., alongside non-zero elements in both arrays. This feels little crazy to me if you can know it statically :).

I would suggest adding an Operation class for wrapping functions applied to sparse arrays that could add information like the types of identities it obeys. Builtin operations and NumPy ufuncs could be automatically be mapped to the appropriate Operation. You basically need this anyways if you want to let users write their own versions of functions like np.sin().

Feb 07 '18 17:02 shoyer

This feels little crazy to me if you can know it statically :).

The reasoning behind this was exactly that: It can be determined dynamically, instead of writing out identities like the ones you suggest for practically every ufunc (which would add a LOT of overhead). Sure, there's a performance cost there, but it's automated. :-)

Maintainability and performance have to be balanced. We would need to maintain a huge list of such identities, and still won't hit every case.

If we were to adopt this approach, we would need to list out a lot of identities and come up with ways to check them while deciding on sparse vs dense broadcasting. In this case, we would need to check for absence of nan and inf. I guess some sort of automatic checker could be made.

For users, it would also be harder to define operations. Rather than defining a simple callable, they would need to go through the process of implementing Operation, defining identities.

The extension to fill values is simple. You calculate something like out_fill_value = func(all_fill_values), then compare if func(some_fill_values, some_other_values) = out_fill_value.

Implementing Operation feels a bit out of scope to me for this library, that just aims to mimic Numpy for sparse arrays.

To be clear: My approach is only exponential in the number of inputs, but linear in the number of output nonzeros (Well, n log n right now, but that can be fixed with Cython).

Feb 07 '18 17:02 hameerabbasi

@hameerabbasi I can see the merits in dynamically discovering which operations preserve zeros. Exponential in the number of inputs should be perfectly fine, so I don't really have an objection here :).

Some of the edge cases in the algorithm you've proposed don't quite make sense to me yet, but maybe it will be clear when you write it in code. In particular, how would you handle a three argument function f(x, y, z) when only one argument is 0? To evaluate f(x, y, 0), you need to know already how to "broadcast" x and y against each other. I guess you already evaluated f(x, 0, 0) and f(0, y, 0), so you could build up this knowledge in a loop (like dynamic programming).

Feb 08 '18 05:02 shoyer

@shoyer In the case f(x, y, 0), we do the following:

Notice that inputs[2] is a scalar.
Set inputs = inputs[0:1]
Set f = lambda x, y: f(x, y, 0)
Feed this as an input to the actual algorithm.

In general, it's more like this:

Build a list of all the scalar inputs and their positions
Set inputs to be the non-scalar inputs.
Set function to be the "frozen" form of the function.

You can see the actual code here, in particular this function which does the actual "freezing". I don't recommend diving into the other broadcasting code, as it's a bit of a mess and needs some spring cleaning and a bit of algorithmic fixing.

It's not final (I hope to make it so this weekend), but you'll get the Gist of how I filter out scalars by "freezing" them into the function before running the actual algorithm.

Feb 08 '18 09:02 hameerabbasi