Lack of resilience towards missing `_ARRAY_DIMENSIONS` xarray's special zarr attribute #280
What is your issue?
Original issue: https://github.com/xarray-contrib/datatree/issues/280
Note: this issue description was generated from a notebook. You can use it to reproduce locally the bug.
Lack of resilience towards missing _ARRAY_DIMENSIONS xarray's special Zarr attribute
from pathlib import Path
import json
from typing import Any
import numpy as np
import xarray as xr
_Utilities
This section only declares utilities functions and do not contain any additional value for the reader
# Set to True to get rich HTML representations in an interactive Notebook session
# Set to False to get textual representations ready to be converted to markdown for issue report
INTERACTIVE = False
# Convert to markdown with
# jupyter nbconvert --to markdown notebooks/datatree-zarr.ipynb
def show(obj: Any) -> Any:
if isinstance(obj, Path):
if INTERACTIVE:
return obj.resolve()
else:
print(obj)
else:
if INTERACTIVE:
return obj
else:
print(obj)
def load_json(path: Path) -> dict:
with open(path, encoding="utf-8") as fp:
return json.load(fp)
Data Creation
I create a dummy Dataset containing a single (label, z)-dimensional DataArray named my_xda.
xda = xr.DataArray(
np.arange(3 * 18).reshape(3, 18),
coords={"label": list("abc"), "z": list(range(18))},
)
xda = xda.chunk({"label": 2, "z": 4})
show(xda)
<xarray.DataArray (label: 3, z: 18)>
dask.array<xarray-<this-array>, shape=(3, 18), dtype=int64, chunksize=(2, 4), chunktype=numpy.ndarray>
Coordinates:
* label (label) <U1 'a' 'b' 'c'
* z (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
xds = xr.Dataset({"my_xda": xda})
show(xds)
<xarray.Dataset>
Dimensions: (label: 3, z: 18)
Coordinates:
* label (label) <U1 'a' 'b' 'c'
* z (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data variables:
my_xda (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>
Data Writing
I persist the Dataset to Zarr
zarr_path = Path() / "../generated/zarrounet.zarr"
xds.to_zarr(zarr_path, mode="w")
show(zarr_path)
../generated/zarrounet.zarr
Data Initial Reading
I read successfully the Dataset
show(xr.open_zarr(zarr_path).my_xda)
<xarray.DataArray 'my_xda' (label: 3, z: 18)>
dask.array<open_dataset-my_xda, shape=(3, 18), dtype=int64, chunksize=(2, 4), chunktype=numpy.ndarray>
Coordinates:
* label (label) <U1 'a' 'b' 'c'
* z (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data Alteration
Then, I alter the Zarr by removing successively all of the _ARRAY_DIMENSIONS from all of the variables' .zattrs: z, label, my_xda, and try to reopen the Zarr. It is in all cases a success. ✔️
# corrupt the variables' `_ARRAY_DIMENSIONS` xarray's attribute
for varname in ("z/.zattrs", "label/.zattrs", "my_xda/.zattrs"):
zattrs_path = zarr_path / varname
assert zattrs_path.is_file()
zattrs_path.write_text("{}")
# Note: it has no impact, only the root .zmetdata seems to be used
show(xr.open_zarr(zarr_path))
<xarray.Dataset>
Dimensions: (label: 3, z: 18)
Coordinates:
* label (label) <U1 'a' 'b' 'c'
* z (z) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Data variables:
my_xda (label, z) int64 dask.array<chunksize=(2, 4), meta=np.ndarray>
However, the last alteration, which is removing the _ARRAY_DIMENSIONS key-value pair from one of the variables in the .zmetadata file present at the root of the zarr, results in an exception when reading. The error message is explicit: KeyError: '_ARRAY_DIMENSIONS' ❌
This means xarray cannot open any Zarr file, but only those who possess an xarray's special
private attribute, _ARRAY_DIMENSIONS.
Because of these choices, Xarray cannot read arbitrary array data, but only Zarr data with valid
_ARRAY_DIMENSIONS
See https://docs.xarray.dev/en/latest/internals/zarr-encoding-spec.html
In a first phase, the error message can probably be more explicit (better than a low-level KeyError), explaining that xarray cannot yet open arbitrary Zarr data.
zmetadata_path = zarr_path / ".zmetadata"
assert zmetadata_path.is_file()
zmetadata = load_json(zmetadata_path)
zmetadata["metadata"]["z/.zattrs"] = {}
zmetadata_path.write_text(json.dumps(zmetadata, indent=4))
1925
show(xr.open_zarr(zarr_path))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:212, in _get_zarr_dims_and_attrs(zarr_obj, dimension_key, try_nczarr)
210 try:
211 # Xarray-Zarr
--> 212 dimensions = zarr_obj.attrs[dimension_key]
213 except KeyError as e:
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/zarr/attrs.py:73, in Attributes.__getitem__(self, item)
72 def __getitem__(self, item):
---> 73 return self.asdict()[item]
KeyError: '_ARRAY_DIMENSIONS'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 show(xr.open_zarr(zarr_path))
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:900, in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, zarr_version, chunked_array_type, from_array_kwargs, **kwargs)
886 raise TypeError(
887 "open_zarr() got unexpected keyword arguments " + ",".join(kwargs.keys())
888 )
890 backend_kwargs = {
891 "synchronizer": synchronizer,
892 "consolidated": consolidated,
(...)
897 "zarr_version": zarr_version,
898 }
--> 900 ds = open_dataset(
901 filename_or_obj=store,
902 group=group,
903 decode_cf=decode_cf,
904 mask_and_scale=mask_and_scale,
905 decode_times=decode_times,
906 concat_characters=concat_characters,
907 decode_coords=decode_coords,
908 engine="zarr",
909 chunks=chunks,
910 drop_variables=drop_variables,
911 chunked_array_type=chunked_array_type,
912 from_array_kwargs=from_array_kwargs,
913 backend_kwargs=backend_kwargs,
914 decode_timedelta=decode_timedelta,
915 use_cftime=use_cftime,
916 zarr_version=zarr_version,
917 )
918 return ds
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/api.py:573, in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, inline_array, chunked_array_type, from_array_kwargs, backend_kwargs, **kwargs)
561 decoders = _resolve_decoders_kwargs(
562 decode_cf,
563 open_backend_dataset_parameters=backend.open_dataset_parameters,
(...)
569 decode_coords=decode_coords,
570 )
572 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 573 backend_ds = backend.open_dataset(
574 filename_or_obj,
575 drop_variables=drop_variables,
576 **decoders,
577 **kwargs,
578 )
579 ds = _dataset_from_backend_dataset(
580 backend_ds,
581 filename_or_obj,
(...)
591 **kwargs,
592 )
593 return ds
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:982, in ZarrBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel, zarr_version)
980 store_entrypoint = StoreBackendEntrypoint()
981 with close_on_error(store):
--> 982 ds = store_entrypoint.open_dataset(
983 store,
984 mask_and_scale=mask_and_scale,
985 decode_times=decode_times,
986 concat_characters=concat_characters,
987 decode_coords=decode_coords,
988 drop_variables=drop_variables,
989 use_cftime=use_cftime,
990 decode_timedelta=decode_timedelta,
991 )
992 return ds
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/store.py:43, in StoreBackendEntrypoint.open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta)
29 def open_dataset( # type: ignore[override] # allow LSP violation, not supporting **kwargs
30 self,
31 filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
(...)
39 decode_timedelta=None,
40 ) -> Dataset:
41 assert isinstance(filename_or_obj, AbstractDataStore)
---> 43 vars, attrs = filename_or_obj.load()
44 encoding = filename_or_obj.get_encoding()
46 vars, attrs, coord_names = conventions.decode_cf_variables(
47 vars,
48 attrs,
(...)
55 decode_timedelta=decode_timedelta,
56 )
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/common.py:210, in AbstractDataStore.load(self)
188 def load(self):
189 """
190 This loads the variables and attributes simultaneously.
191 A centralized loading function makes it easier to create
(...)
207 are requested, so care should be taken to make sure its fast.
208 """
209 variables = FrozenDict(
--> 210 (_decode_variable_name(k), v) for k, v in self.get_variables().items()
211 )
212 attributes = FrozenDict(self.get_attrs())
213 return variables, attributes
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:519, in ZarrStore.get_variables(self)
518 def get_variables(self):
--> 519 return FrozenDict(
520 (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
521 )
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/core/utils.py:471, in FrozenDict(*args, **kwargs)
470 def FrozenDict(*args, **kwargs) -> Frozen:
--> 471 return Frozen(dict(*args, **kwargs))
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:520, in <genexpr>(.0)
518 def get_variables(self):
519 return FrozenDict(
--> 520 (k, self.open_store_variable(k, v)) for k, v in self.zarr_group.arrays()
521 )
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:496, in ZarrStore.open_store_variable(self, name, zarr_array)
494 data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, self))
495 try_nczarr = self._mode == "r"
--> 496 dimensions, attributes = _get_zarr_dims_and_attrs(
497 zarr_array, DIMENSION_KEY, try_nczarr
498 )
499 attributes = dict(attributes)
501 # TODO: this should not be needed once
502 # https://github.com/zarr-developers/zarr-python/issues/1269 is resolved.
File ~/.cache/pypoetry/virtualenvs/datatree-experimentation-Sa4oWCLA-py3.10/lib/python3.10/site-packages/xarray/backends/zarr.py:222, in _get_zarr_dims_and_attrs(zarr_obj, dimension_key, try_nczarr)
220 # NCZarr defines dimensions through metadata in .zarray
221 zarray_path = os.path.join(zarr_obj.path, ".zarray")
--> 222 zarray = json.loads(zarr_obj.store[zarray_path])
223 try:
224 # NCZarr uses Fully Qualified Names
225 dimensions = [
226 os.path.basename(dim) for dim in zarray["_NCZARR_ARRAY"]["dimrefs"]
227 ]
File ~/.pyenv/versions/3.10.12/lib/python3.10/json/__init__.py:339, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
337 else:
338 if not isinstance(s, (bytes, bytearray)):
--> 339 raise TypeError(f'the JSON object must be str, bytes or bytearray, '
340 f'not {s.__class__.__name__}')
341 s = s.decode(detect_encoding(s), 'surrogatepass')
343 if (cls is None and object_hook is None and
344 parse_int is None and parse_float is None and
345 parse_constant is None and object_pairs_hook is None and not kw):
TypeError: the JSON object must be str, bytes or bytearray, not dict
xr.show_versions()
import warnings
warnings.filterwarnings("ignore")
xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.10.12 (main, Aug 15 2023, 11:50:32) [GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-92-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2023.10.1
pandas: 2.1.3
numpy: 1.25.2
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.16.1
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.8.0
pip: 23.1.2
conda: None
pytest: None
mypy: None
IPython: 8.17.2
sphinx: None
Thanks for (re-)raising this @etienneschalk .
This means xarray cannot open any Zarr file, but only those who possess an xarray's special private attribute,
_ARRAY_DIMENSIONS.
Note that in zarr v3 this was formalized as the dimension_names. But those are optional, so we still have the same problem.
In a first phase, the error message can probably be more explicit (better than a low-level KeyError), explaining that xarray cannot yet open arbitrary Zarr data.
A better error message would be good.
The other option would be to automatically create default dimension names like we do in the constructors
In [7]: da = xr.DataArray(np.arange(12).reshape(3, 2, 2))
In [8]: da.dims
Out[8]: ('dim_0', 'dim_1', 'dim_2')
I think there is actually a similar but separate issue with another type of zarr store that cannot be opened with xarray. My understanding is that Zarr doesn't actually require that arrays in the same group with shared dimension names actually have the same size along that dimension. This would break the model of xarray.Dataset for that group.
The other option would be to automatically create default dimension names like we do in the constructors
Here are some examples:
(1) A Zarr group with two 1-dimensional variables with anonymous dimensions but same positional dimension sizes) should be openable with xarray, like we can create this Dataset:
xr.Dataset({"xda_1": xr.DataArray([1]), "xda_2": xr.DataArray([2])})
<xarray.Dataset>
Dimensions: (dim_0: 1)
Dimensions without coordinates: dim_0
Data variables:
xda_1 (dim_0) int64 1
xda_2 (dim_0) int64 2
(2) A Zarr group with two 1-dimensional variables with anonymous dimensions and differing positional dimension sizes would not be openable with xarray. The error would be similar to try to create this Dataset:
xr.Dataset({"xda_1": xr.DataArray([1]), "xda_2": xr.DataArray([1, 2])})
ValueError: cannot reindex or align along dimension 'dim_0' because of conflicting dimension sizes: {1, 2}
This error would be better than the KeyError: xarray made its best effort to open the Zarr, but will not break its own data model.
To generalize:
Xarray could read Zarr groups, by following the same rules of when building a Dataset from DataArrays with no explicit dimensions names. Error handling would be somehow delegated to the same logic that throws the ValueError above (to remove logic duplication, not re-implementing existing error handling)
- If the dimensions of the zarr group are anonymous, then xarray auto-affects
dim_0...dim_n. Then, the only way for the Zarr to be readable is to have ordered dimensions of the same size from left to right. The k-th dim must have the same size across all variables of the group. This is the same requirement as when passing multiple DataArrays without explicit dimension names to the Dataset constructor ; - If the dimensions of the zarr group are named (with
_ARRAY_DIMENSIONSor Zarr 3'sdimension_names), then the Zarr store gets more freedom for ordering the dimensions (named rather than positional, to re-use the terminology from Indexing and Selecting data. However, all dimensions sharing the same name should be of the same size (xarray constraint)
Do you think this makes sense?
I think there is actually a similar but separate issue with another type of zarr store that cannot be opened with xarray.
Do you have the link to the issue aforementioned? This would then be an expected behaviour according to the existing Dataset rules: this kind of Zarr would not be readable by xarray as dimensions among a group are shared between its variables and must be of the same size.
To conclude, there is not a perfect match between the dimensions concept from xarray and the dimensions concept from Zarr. However, recycling the rules of the Dataset constructor to open Zarr groups might increase the size of the space of "xarray-compatible" Zarrs, while preserving xarray data model.