[REVIEW] Support of alternative array classes
This PR implements support of alternative array types (other than NumPy arrays) by introducing a new argument, meta_array, that specifies the type/class of the underlying array.
The meta_array argument can be any class instance that can be used as the like argument in NumPy (see NEP 35).
Also, this PR implements support of CuPy arrays by introducing a CPU compressor:
class CuPyCPUCompressor(Codec):
"""CPU compressor for CuPy arrays
This compressor moves CuPy arrays to host memory before compressing
the arrays using `compressor`.
Parameters
----------
compressor : numcodecs.abc.Codec
The codec to use for compression and decompression.
"""
Example
A common limitation in GPU programming is available GPU memory. In the following, we create a Zarr array of CuPy arrays that are saved to host memory using the Zlib codec for compression/decompression:
zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))
If we don't want compression but still want to copy to/from host memory, we can do the following:
zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor())
If the store handles CuPy arrays directly, we can disable compression:
zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=None)
Notice
- This feature becomes more relevant when we have stores that support direct read/write of GPU memory. E.g. a store that use GPUDirect Storage (GDS) would be able to bypass the host memory when copying GPU memory to/from disk.
- This PR implements a new version of
ensure_ndarray()andensure_contiguous_ndarray()that accepts CuPy arrays as well as NumPy arrays. Instead, we should properly implement more generic versions of them in numcodecs (see https://github.com/zarr-developers/numcodecs/pull/212). - This PR depend on NumPy version 1.20 since it uses NEP 35. Is this an issue?
TODO
- [ ] Add unit tests and/or doctests in docstrings
- [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
- [ ] New/modified features documented in docs/tutorial.rst
- [ ] Changes documented in docs/release.rst
- [ ] GitHub Actions have all passed
- [ ] Test coverage is 100% (Codecov passes)
cc. @jakirkham
Hello @madsbk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2022-09-05 12:30:20 UTC
Codecov Report
Merging #934 (3ed7a7e) into main (f6698f6) will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #934 +/- ##
========================================
Coverage 99.94% 99.95%
========================================
Files 35 36 +1
Lines 13965 14084 +119
========================================
+ Hits 13958 14077 +119
Misses 7 7
| Impacted Files | Coverage Δ | |
|---|---|---|
| zarr/creation.py | 100.00% <ø> (ø) |
|
| zarr/core.py | 100.00% <100.00%> (ø) |
|
| zarr/hierarchy.py | 99.79% <100.00%> (+<0.01%) |
:arrow_up: |
| zarr/storage.py | 100.00% <100.00%> (ø) |
|
| zarr/tests/test_meta_array.py | 100.00% <100.00%> (ø) |
|
| zarr/util.py | 100.00% <100.00%> (ø) |
cc @d-v-b @joshmoore (as we discussed this last week it would be good to get your takes as well 🙂)
cc @d70-t ( in case you have thoughts here as well 🙂 )
Knowing little about the requirements here, just looking at the example:
zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))
I wonder a couple of things:
- Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)
- Similarly, I wonder if a compressor wrapper is the best strategy. It may well be, but it would be good to have an idea of other related use cases. Are there any inspection methods on codecs to inform the wrappers of what they need to know? What other wrappers could we imagine? And as always, could we make it work with the user needing to remember to add the wrapper? (perhaps by making use of @data-apis style methods??)
- Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)
Agree, it would be a good idea to have Store define a default meta_array.
I'm with @joshmoore, that this example:
z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))looks a little odd.
Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.
The main goal of this PR is to make it possible to write GPU aware stores without having Zarr force-casting everything to NumPy arrays.
When thinking of what I'd want to express in this line, I'd say
I want to put my array
aonto a store (which is implicitlydictin this case), in chunks of 10 and it should be zlib-compressed.
Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information. Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type. I think this is a great idea and I think this PR actually gets us some of the way.
Sorry I should have been more clear, the
CuPyCPUCompressorhere it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.
True, a compressor-wrapper seems to be a relatively light way to implement such GPU aware stores. But I'm wondering if that's really the API we'll want to end up with. Probably we'll also want to end up having such compressor-wrappers as a simple way to make more compressor-implementations, but maybe we don't want to write them out at this function call. That's about what I meant with separating compressor-implementations and compressor-specifications. We likely want to create zlib-compressed data on the store (the specification), but might want to use the CuPyCPU(Zlib) implementation for the transformation at that very moment of writing the data.
Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.
Do we need that information, when creating the array from scratch? The array will probably only contain the metadata keys and those should maybe be independent of the compressor-implementation.
Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.
Yes, that's about what I had in mind. It would be great to have arrays of any kind on one side and stores and compression specifications on the other side and some flexible, auto-optimizing routing in between. I believe that the specific algorithm of how to compress / decompress and move the data in between should neither belong to the store nor to the array-type.
In particular, it seems to be a totally valid approach to use e.g. a CPU to write data to some (e.g. S3-) store and read data back from the same store later on via GPU. If one might want to go totally crazy, one might even write some parts of an array using CPU and other parts using GPU (I know of some people in C++ communities who work on parallelizing loops such that parts of the same loop run on available CPUs and parts run on available GPUs in order to get the most out of a system).
I just stumbled again across the mentioned entrypoints... It's probably worth a thought if something roughly like:
compressors =[
{"array": "numpy", "algorithm": "zlib": "compressor": Zlib, "priority": 1},
{"array": "cupy", "algorithm": "zlib": "compressor": CuPyCPUCompressor(Zlib), "priority": 2},
{"array": "cupy", "algorithm": "zlib": "compressor": CuPyGDSZlib, "priority": 1},
...
]
could be established via entrypoints? That list would be dependent on which libraries are present on the system and would help finding the right compressor implementation for a given combination of array and codec.... Still I don't have though yet about how that might interfere with filters, which may also play a role...
@d70-t I agree with your vision but I think we should move in small steps.
I think we all can agree that we need some way to specify the desired array-like type of a getitem operation. For backward compatibility and for convenience, I suggest that we specify this through the meta_array argument on zarr.Array creation. But this is only the default array-like type, I agree with @d70-t, it should be possible to overwrite the default type using something like:
selection = z.get(slice(3,7), meta_array=cupy.empty(()))
And as @joshmoore suggest, we might even make it higher-level configurable (e.g. through a config file and/or context manager).
However, I think this is outside of the scope of this PR. The goal here is to avoid force-casting all data to NumPy arrays.
In any case, up to now, the
CuPyCPUCompressor(which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr arrayz. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue thatcupy_cpu_compressorshould maybe not appear in.zarraymetadata.
Again, I agree. I can move the implementation of CuPyCPUCompressor to test_cupy.py to make it clear that it is just a way to use CuPy arrays with the current API and is not part of the official API.
:+1: for moving in small steps. And I also agree to the rest of the post @madsbk
The only thing vision-wise which I believe should be accounted for right now, would be that indications of the (in-memory) array type should not be part of the zarr-metadata (and I'm still uncertain about if it would be ok to do so as a part of a test. Probably that's ok). But that's more a personal feeling and should maybe be up to discussion and others might have different opinions.
For more context this is intended to be used with KvikIO, which handles the data loading from disk to GPU
Rebased and fixed the conflicts. Currently seeing a number of these:
if not hasattr(value, 'shape'):
[5734](https://github.com/zarr-developers/zarr-python/runs/5106294836?check_suite_focus=true#step:7:5734)
> value = np.asanyarray(value, like=self._meta_array)
[5735](https://github.com/zarr-developers/zarr-python/runs/5106294836?check_suite_focus=true#step:7:5735)
E TypeError: asanyarray() got an unexpected keyword argument 'like'
@joshmoore do you know the minimum version of numpy used in the tests ? I believe NEP-35 was included in NumPy 1.20
Ah, ok. There's a job that pins to 1.17 for testing the minimal version.
We use the following in dask to check:
_numpy_120 = _np_version >= parse_version("1.20.0")
@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")
def test_foo()
Would that be ok here ?
@madsbk can you restrict the test to only run when numpy.__version__ >= 1.20.0 . I imagine this is the last piece here
@madsbk can you restrict the test to only run when
numpy.__version__ >= 1.20.0. I imagine this is the last piece here
Will do, but right now I am waiting for https://github.com/zarr-developers/numcodecs/pull/305 to be merged, which will simplify this PR quite a bit :)
New conflicts due to merge of the v3 branch are resolved in PR https://github.com/madsbk/zarr-python/pull/1
The plan going forward
- Wait for https://github.com/zarr-developers/zarr-python/pull/988 to be merged.
- Simplify this PR and reduce the scope to the original goal of avoiding force-casting all data to NumPy arrays.
- Move the
CuPyCPUCompressorclass to the test suite, it will not be part of the API. - Merge this PR.
- Gain experience using Zarr with CuPy arrays.
- Introduce new API and semantic to ease the transition between NumPy and CuPy backed Zarr arrays.
This is ready for another round of reviews
cc. @jakirkham, @d70-t
@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?
@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?
The codecov errors can be ignored. As far as I can see, all of them are false negatives.
The codecov errors can be ignored. As far as I can see, all of them are false negatives.
Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?
Or is the issue that the cupy tests aren't being run and therefore the
meta_arrayproperty is not accessed anywhere?
You are absolutely right! No idea how I could miss that :)
Adding tests of meta_array when CuPy isn't available.
Hi @joshmoore, anything you need from my side to get progress here?
Hi @madsbk. No, sorry. This just landed on the vacation :mountain:. I'll start reviewing for a 2.13 release ASAP.
Rolling into a 2.13 pre-release. If anyone has any lingering concerns about the API, there would still be a (small) window to address them in.
Thanks Mads for working on this and Josh for reviewing! 🙏