zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

[REVIEW] Support of alternative array classes

Open madsbk opened this issue 4 years ago • 25 comments

This PR implements support of alternative array types (other than NumPy arrays) by introducing a new argument, meta_array, that specifies the type/class of the underlying array.

The meta_array argument can be any class instance that can be used as the like argument in NumPy (see NEP 35).

Also, this PR implements support of CuPy arrays by introducing a CPU compressor:


class CuPyCPUCompressor(Codec):
    """CPU compressor for CuPy arrays

    This compressor moves CuPy arrays to host memory before compressing
    the arrays using `compressor`.

    Parameters
    ----------
    compressor : numcodecs.abc.Codec
        The codec to use for compression and decompression.
    """

Example

A common limitation in GPU programming is available GPU memory. In the following, we create a Zarr array of CuPy arrays that are saved to host memory using the Zlib codec for compression/decompression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

If we don't want compression but still want to copy to/from host memory, we can do the following:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor())

If the store handles CuPy arrays directly, we can disable compression:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=None)

Notice

  • This feature becomes more relevant when we have stores that support direct read/write of GPU memory. E.g. a store that use GPUDirect Storage (GDS) would be able to bypass the host memory when copying GPU memory to/from disk.
  • This PR implements a new version of ensure_ndarray() and ensure_contiguous_ndarray() that accepts CuPy arrays as well as NumPy arrays. Instead, we should properly implement more generic versions of them in numcodecs (see https://github.com/zarr-developers/numcodecs/pull/212).
  • This PR depend on NumPy version 1.20 since it uses NEP 35. Is this an issue?

TODO

  • [ ] Add unit tests and/or doctests in docstrings
  • [ ] Add docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] New/modified features documented in docs/tutorial.rst
  • [ ] Changes documented in docs/release.rst
  • [ ] GitHub Actions have all passed
  • [ ] Test coverage is 100% (Codecov passes)

cc. @jakirkham

madsbk avatar Jan 11 '22 13:01 madsbk

Hello @madsbk! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-09-05 12:30:20 UTC

pep8speaks avatar Jan 11 '22 17:01 pep8speaks

Codecov Report

Merging #934 (3ed7a7e) into main (f6698f6) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff            @@
##             main     #934    +/-   ##
========================================
  Coverage   99.94%   99.95%            
========================================
  Files          35       36     +1     
  Lines       13965    14084   +119     
========================================
+ Hits        13958    14077   +119     
  Misses          7        7            
Impacted Files Coverage Δ
zarr/creation.py 100.00% <ø> (ø)
zarr/core.py 100.00% <100.00%> (ø)
zarr/hierarchy.py 99.79% <100.00%> (+<0.01%) :arrow_up:
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_meta_array.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

codecov[bot] avatar Jan 11 '22 19:01 codecov[bot]

cc @d-v-b @joshmoore (as we discussed this last week it would be good to get your takes as well 🙂)

jakirkham avatar Jan 21 '22 08:01 jakirkham

cc @d70-t ( in case you have thoughts here as well 🙂 )

jakirkham avatar Jan 21 '22 08:01 jakirkham

Knowing little about the requirements here, just looking at the example:

zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

I wonder a couple of things:

  • Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)
  • Similarly, I wonder if a compressor wrapper is the best strategy. It may well be, but it would be good to have an idea of other related use cases. Are there any inspection methods on codecs to inform the wrappers of what they need to know? What other wrappers could we imagine? And as always, could we make it work with the user needing to remember to add the wrapper? (perhaps by making use of @data-apis style methods??)

joshmoore avatar Jan 21 '22 08:01 joshmoore

  • Could this be done higher-level? As @jakirkham mentions above, could array initialization get this from the store. Nothing against making it an option here, but I imagine you will typically want to ignore the specifics by the time you get to this point in your code. (We may need a higher-level configuration object to not need to pass every flag down to arrays)

Agree, it would be a good idea to have Store define a default meta_array.

I'm with @joshmoore, that this example:

z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))

looks a little odd.

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively. The main goal of this PR is to make it possible to write GPU aware stores without having Zarr force-casting everything to NumPy arrays.

When thinking of what I'd want to express in this line, I'd say

I want to put my array a onto a store (which is implicitly dict in this case), in chunks of 10 and it should be zlib-compressed.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information. Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type. I think this is a great idea and I think this PR actually gets us some of the way.

madsbk avatar Jan 21 '22 13:01 madsbk

Sorry I should have been more clear, the CuPyCPUCompressor here it is meant as a method to use the existing stores and compressors together with CuPy without having to make them all GPU aware. This way, we can implement GPU aware stores and compressors iteratively.

True, a compressor-wrapper seems to be a relatively light way to implement such GPU aware stores. But I'm wondering if that's really the API we'll want to end up with. Probably we'll also want to end up having such compressor-wrappers as a simple way to make more compressor-implementations, but maybe we don't want to write them out at this function call. That's about what I meant with separating compressor-implementations and compressor-specifications. We likely want to create zlib-compressed data on the store (the specification), but might want to use the CuPyCPU(Zlib) implementation for the transformation at that very moment of writing the data.

Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information.

Do we need that information, when creating the array from scratch? The array will probably only contain the metadata keys and those should maybe be independent of the compressor-implementation.

Now if I understand you correctly, you are saying that we could delay the array-like type inference until the user reads or writes the Zarr array. And if we have compatible compressors, we can even write using one array-like type and read using another array-like type.

Yes, that's about what I had in mind. It would be great to have arrays of any kind on one side and stores and compression specifications on the other side and some flexible, auto-optimizing routing in between. I believe that the specific algorithm of how to compress / decompress and move the data in between should neither belong to the store nor to the array-type.

In particular, it seems to be a totally valid approach to use e.g. a CPU to write data to some (e.g. S3-) store and read data back from the same store later on via GPU. If one might want to go totally crazy, one might even write some parts of an array using CPU and other parts using GPU (I know of some people in C++ communities who work on parallelizing loops such that parts of the same loop run on available CPUs and parts run on available GPUs in order to get the most out of a system).

d70-t avatar Jan 21 '22 15:01 d70-t

I just stumbled again across the mentioned entrypoints... It's probably worth a thought if something roughly like:

compressors =[
    {"array": "numpy", "algorithm": "zlib": "compressor": Zlib, "priority": 1},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyCPUCompressor(Zlib), "priority": 2},
    {"array": "cupy", "algorithm": "zlib": "compressor": CuPyGDSZlib, "priority": 1},
    ...
]

could be established via entrypoints? That list would be dependent on which libraries are present on the system and would help finding the right compressor implementation for a given combination of array and codec.... Still I don't have though yet about how that might interfere with filters, which may also play a role...

d70-t avatar Jan 21 '22 15:01 d70-t

@d70-t I agree with your vision but I think we should move in small steps.

I think we all can agree that we need some way to specify the desired array-like type of a getitem operation. For backward compatibility and for convenience, I suggest that we specify this through the meta_array argument on zarr.Array creation. But this is only the default array-like type, I agree with @d70-t, it should be possible to overwrite the default type using something like:

selection = z.get(slice(3,7), meta_array=cupy.empty(())) 

And as @joshmoore suggest, we might even make it higher-level configurable (e.g. through a config file and/or context manager).

However, I think this is outside of the scope of this PR. The goal here is to avoid force-casting all data to NumPy arrays.

In any case, up to now, the CuPyCPUCompressor (which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor should maybe not appear in .zarray metadata.

Again, I agree. I can move the implementation of CuPyCPUCompressor to test_cupy.py to make it clear that it is just a way to use CuPy arrays with the current API and is not part of the official API.

madsbk avatar Jan 24 '22 08:01 madsbk

:+1: for moving in small steps. And I also agree to the rest of the post @madsbk

The only thing vision-wise which I believe should be accounted for right now, would be that indications of the (in-memory) array type should not be part of the zarr-metadata (and I'm still uncertain about if it would be ok to do so as a part of a test. Probably that's ok). But that's more a personal feeling and should maybe be up to discussion and others might have different opinions.

d70-t avatar Jan 24 '22 10:01 d70-t

For more context this is intended to be used with KvikIO, which handles the data loading from disk to GPU

jakirkham avatar Jan 28 '22 19:01 jakirkham

Rebased and fixed the conflicts. Currently seeing a number of these:

if not hasattr(value, 'shape'):
[5734](https://github.com/zarr-developers/zarr-python/runs/5106294836?check_suite_focus=true#step:7:5734)
>               value = np.asanyarray(value, like=self._meta_array)
[5735](https://github.com/zarr-developers/zarr-python/runs/5106294836?check_suite_focus=true#step:7:5735)
E               TypeError: asanyarray() got an unexpected keyword argument 'like'

joshmoore avatar Feb 08 '22 14:02 joshmoore

@joshmoore do you know the minimum version of numpy used in the tests ? I believe NEP-35 was included in NumPy 1.20

quasiben avatar Feb 08 '22 15:02 quasiben

Ah, ok. There's a job that pins to 1.17 for testing the minimal version.

joshmoore avatar Feb 08 '22 16:02 joshmoore

We use the following in dask to check:

_numpy_120 = _np_version >= parse_version("1.20.0")
@pytest.mark.skipif(not _numpy_120, reason="NEP-35 is not available")
def test_foo()

Would that be ok here ?

quasiben avatar Feb 08 '22 16:02 quasiben

@madsbk can you restrict the test to only run when numpy.__version__ >= 1.20.0 . I imagine this is the last piece here

quasiben avatar Feb 23 '22 19:02 quasiben

@madsbk can you restrict the test to only run when numpy.__version__ >= 1.20.0 . I imagine this is the last piece here

Will do, but right now I am waiting for https://github.com/zarr-developers/numcodecs/pull/305 to be merged, which will simplify this PR quite a bit :)

madsbk avatar Feb 24 '22 13:02 madsbk

New conflicts due to merge of the v3 branch are resolved in PR https://github.com/madsbk/zarr-python/pull/1

grlee77 avatar May 24 '22 00:05 grlee77

New conflicts due to merge of the v3 branch are resolved in PR madsbk#1

Thanks @grlee77 !

madsbk avatar May 30 '22 07:05 madsbk

The plan going forward

  1. Wait for https://github.com/zarr-developers/zarr-python/pull/988 to be merged.
  2. Simplify this PR and reduce the scope to the original goal of avoiding force-casting all data to NumPy arrays.
  3. Move the CuPyCPUCompressor class to the test suite, it will not be part of the API.
  4. Merge this PR.
  5. Gain experience using Zarr with CuPy arrays.
  6. Introduce new API and semantic to ease the transition between NumPy and CuPy backed Zarr arrays.

madsbk avatar May 30 '22 07:05 madsbk

This is ready for another round of reviews

cc. @jakirkham, @d70-t

madsbk avatar Aug 02 '22 11:08 madsbk

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

joshmoore avatar Aug 06 '22 02:08 joshmoore

@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front?

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

madsbk avatar Aug 08 '22 07:08 madsbk

The codecov errors can be ignored. As far as I can see, all of them are false negatives.

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

joshmoore avatar Aug 11 '22 01:08 joshmoore

Or is the issue that the cupy tests aren't being run and therefore the meta_array property is not accessed anywhere?

You are absolutely right! No idea how I could miss that :)

Adding tests of meta_array when CuPy isn't available.

madsbk avatar Aug 11 '22 07:08 madsbk

Hi @joshmoore, anything you need from my side to get progress here?

madsbk avatar Sep 05 '22 12:09 madsbk

Hi @madsbk. No, sorry. This just landed on the vacation :mountain:. I'll start reviewing for a 2.13 release ASAP.

joshmoore avatar Sep 06 '22 11:09 joshmoore

Rolling into a 2.13 pre-release. If anyone has any lingering concerns about the API, there would still be a (small) window to address them in.

joshmoore avatar Sep 08 '22 07:09 joshmoore

Thanks Mads for working on this and Josh for reviewing! 🙏

jakirkham avatar Sep 23 '22 17:09 jakirkham