When opening a Zarr store, the chunks='auto'` kwarg seems to be ignored
Hi,
I noticed a discrepency between the behaviour of xarray's open_zarr and datatree's open_datatree with engine='zarr'.
I documented it in a pre-executed notebook available at https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb (the whole project can be cloned and executed locally if needed, it requires poetry)
To summarize:
Actual:
- xarray's
open_zarr- No chunks kwarg: Stored chunks are used.
- With
chunks='auto': Stored chunks are used.
- datatree's
open_datatreewithengine='zarr'- No chunks kwarg: No chunking performed.
- With
chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset
Expected:
I expected a similar behaviour from datatree as the one from xarray. Since Zarr is format that natively handle chunks, I would have expected that when opening a Zarr store with no chunks kwarg or chunks='auto', the stored chunks were to be used.
Thanks!
Thanks for raising this @etienneschalk !
open_datatree internally calls xarray.open_dataset (not xarray.open_zarr) but there are currently some differences between xarray.open_dataset and xarray.open_zarr. Does the behaviour of open_datatree differ from xarray.open_dataset?
Regardless, the behaviour you describe does sound desirable, so we should fix that somewhere in the stack.
cc @jhamman
Hello @TomNicholas ,
Does the behaviour of open_datatree differ from xarray.open_dataset?
After testing, the behaviour of open_datatree is indeed identical to xarray's open_dataset:
- xarray's
open_dataset- No chunks kwarg: No chunking is performed. ~~Stored chunks are used.~~
- With chunks='auto': A chunk identical to the shape of the data is used. This means chunking is useless as there is only a single chunk representing the whole dataset ~~Stored chunks are used.~~
which is consistent with your statement:
open_datatree internally calls xarray.open_dataset
In that case, I would suggest to:
- Keep the behaviour of
open_datatree, as it uses and is of the same family asopen_dataset(open_{xarray's data structure}syntax) - :arrow_forward: Add a new
datatree.open_zarr()function, with the same behaviour asxarray.open_zarr, maybe using it internally too. And updating the documentation ofopen_datatreeto nudge users into usingopen_zarrinstead if they want to use zarr
Do you think this would be a good idea?
Thanks!
Thank you for testing!
Again, this is an upstream xarray issue. Datatree should follow whatever xarray's behaviour is.
Add a new
datatree.open_zarr()function, with the same behaviour asxarray.open_zarr, maybe using it internally too.
This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...
Hi @TomNicholas
This might be a good idea, but xarray currently has both open_zarr and open_dataset, and there is an unresolved discussion about whether to get rid of one in favour of the other...
So, this means, while this discussion is not settled, implementing an open_zarr in datatree might be a waste of effort, if I understand correctly, in the case where open_zarr would be integrated into the open_dataset. However, if this happens in upstream xarray, the correct behaviour of open_zarr should be kept, not the one of the existing open_dataset that does not handle chunks properly.
Do you have a link to this discussion, by any chance? I would be interested to learn more about this.
Thanks, have a nice day!
the difference between open_zarr and open_dataset is that for open_zarr "auto" (the default) translates to {} if a chunk manager is available (like dask or cubed) or None otherwise, which are then forwarded to open_dataset. For open_dataset, the default is None (no chunks), while {} is the same as for open_zarr and "auto" is dask's auto-chunking (see dask.array.Array.rechunk for more details). So in summary, open_zarr is a wrapper of open_dataset, with a different default for chunks and a different meaning for "auto".
I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).
Edit: this means that to get the on-disk chunking you can use open_datatree(..., chunks={})
Thanks @keewis .
I believe the whole "auto" is actually {} for open_zarr is just confusing, so maybe we should aim to harmonize this in xarray (like, switch the default immediately and emit a deprecation warning if "auto" is passed).
100%. This kind of thing really trips up users. Do we have an open issue for that in xarray or should we make one now?
I think the "deprecate open_zarr" issue should be fine to reuse for this: pydata/xarray#7495
Thanks for the chunks={} tip! This is indeed the behaviour I expected.
This is really important when trying to open chunked large Zarr data with datatree to keep the original chunks.
I updated my test notebook: https://github.com/etienneschalk/datatree-experimentation/blob/main/notebooks/bug-chunk-auto-not-considered.ipynb section "With chunks={} kwarg :ok:"