Storing Sparse arrays to Zarr
Periodically we have had users request some way to store sparse arrays with Zarr. TBH this is actually pretty doable generally today as demonstrated by this comment. Also this strategy works nicely with Sparse. Admittedly these examples are showing how this works with in-memory Zarr Arrays. Though this would work just as well with any MutableMapping derived store.
Given there seems to be a fair bit of interest in being able to work with more general N-D sparse arrays and having a flexible way to store them, am wondering if it makes sense to provide some functionality in Sparse to store data in Zarr Arrays. Happy to answer any questions. Also would be interested to hear thoughts on this proposal. :)
Just so I understand here -- Are you proposing making this library a back-end for Zarr or Making Zarr a back-end for this library?
The issue with the latter is that we use Numba to perform a number of operations (including arithmetic and indexing), does Numba work with Zarr?
The other issue is that at this point in time, DOK is mutable but COO isn't.
It would be nice if you could enumerate what API or changes you would need in this library to make this possible -- I'm definitely willing to work with the Zarr team.
Short-term having a way to load a Zarr Array into a Sparse array and store a Sparse array into a Zarr Array would be pretty good. These could be similar to the from_numpy and todense methods. The latter case is pretty much solved. It would just benefit from having a convenience method. The former should be solvable any number of ways. As Zarr seems kind of similar to DOK, maybe that would be the easiest way to load it in.
Long-term it would be interesting to have a Zarr-backed Sparse array. The main benefits here would be working with larger than memory sparse arrays and/or working with other storage backends. However this will take some more thought as you have noted.
Okay, I just started thinking about this... Since the long-term goal of this project is to have SciPy depend on it, a dependency (even if optional) on Zarr wouldn't be so nice.
Of course, feel free to duck-patch COO on import so that sparse.COO.to_zarr and sparse.COO.from_zarr exist and do the right thing. 😄
As long as it doesn't rely on fringe functionality of sparse it should be fine.
Of course, I'd recommend patching SparseArray.? instead, and then doing .toformat(COO) or cls(coo_arr) instead.
I recently raised a zarr issue on this zarr-developers/zarr#424. I'm not sure what will come of it. Regardless, I like the idea of having the saving functionality live in zarr.