to_zarr removes global attributes in destination dataset
What happened?
Adding new variables to a zarr dataset with to_zarr() always removes the existing global attributes. New global attributes in the source dataset are not always added to the destination dataset depending on how to_zarr() is called.
What did you expect to happen?
I would expect that existing global attributes would always be preserved. If there are new global attributes I would expect them to be added to the existing global attributes instead of replacing all existing global attributes.
Minimal Complete Verifiable Example
import xarray as xr
from pyproj import CRS
local_zarr = 'sample.zarr'
ds_sample = xr.tutorial.load_dataset("air_temperature")
# Make a local copy
ds_sample.to_zarr(local_zarr, mode='w')
ds_sample = xr.open_dataset(local_zarr, engine='zarr',
backend_kwargs={'consolidated':True}, chunks={},
decode_coords=True)
# Create CRS metadata
crs_meta = CRS.from_epsg(4326).to_cf()
ds_new = xr.Dataset(data_vars={"crs": ([], 1, crs_meta)})
ds_new.attrs['note'] = 'please add this'
# Add all variables from ds_new to the zarr
# NOTE: This adds the new global attribute but also removes
# all existing global attributes
ds_new.to_zarr(local_zarr, mode='a')
# Add selected variable(s) to zarr dataset
# NOTE: This does not copy new global attributes
# and removes all existing global attributes
# ds_new['crs'].to_zarr(local_zarr, mode='a')
# Re-open local zarr store
ds_sample = xr.open_dataset(local_zarr, engine='zarr',
backend_kwargs={'consolidated':True}, chunks={},
decode_coords=True)
ds_sample
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1
xarray: 2024.1.1 pandas: 2.2.0 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.0 pydap: installed h5netcdf: 1.3.0 h5py: 3.8.0 Nio: None zarr: 2.17.0 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2024.2.0 distributed: 2024.2.0 matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.12.2 cupy: None pint: 0.23 sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 24.0 conda: None pytest: 8.0.0 mypy: None IPython: 8.21.0 sphinx: None
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
I'm not a maintainer, but ... not sure if this is a bug:
https://docs.xarray.dev/en/stable/getting-started-guide/faq.html#what-is-your-approach-to-metadata
An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option, and there is a global flag, accessible with xarray.set_options(), for setting this to be always True or False). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.
Sorry if I'm adding noise to the issue, just commenting because I'm looking now at using to_zarr with a DataArray that will have attrs
I believe this is a bug. The Modifying existing Zarr stores section of the documentation outlines ways to modify existing zarr stores and to limit what data is modified using mode, append_dim, and region. And yet dataset attrs is always overwritten which is unintuitive and is in fact propagating the local attrs to the saved file in an undesirable fashion. With mode='a', and I'm guessing also with mode='a-':
-
xarray.Dataset.to_zarr()overwrites the saved dataset attrs with its own, even when it has none. -
xarray.DataArray.to_zarr()overwrites the saved dataset attrs with its own, which is{}since it's aDataArrayand doesn't haveDatasetattrs.
What's the desired behaviour here? Never update zarr.Group.attrs when appending?
I'm not knowledgeable of the zarr API, but my desired behavior when appending is to not update the root (Dataset) attrs. @pnorton-usgs's example shows a use case where it seems they would like to append to and overwrite conflicts of root attrs, though this would require more processing and might not be desirable for use cases with large metadata.
However this is separate from dealing with variable (DataArray) attrs. When appending a new variable using mode='a' it is desired to have the added variables' attrs get added, as is the current case. Whereas when appending to or modifying an existing variable with mode='a-' and some combination of append_dim and region it is likely desirable to not touch the variables' attrs, though I'm not familiar with the status quo here.
Looking briefly over the code it seems like we'd just need to change https://github.com/pydata/xarray/blob/8a23e249d44b6e677600244efed4491a2749db73/xarray/backends/zarr.py#L650-L651
to include mode not in ["a", "a-"].
Variable attrs seems to be written here, which we can leave untouched: https://github.com/pydata/xarray/blob/8a23e249d44b6e677600244efed4491a2749db73/xarray/backends/zarr.py#L779
I think not updating Group attrs when appending sounds right. It is consistent with the Array writing behaviour (it seems like).
Handling conflicts seems a bit too much complexity.
cc @rabernat @slevang @jhamman for opinions
With mode="a", I'm not sure this does match the array behavior. If you don't set append_dim, then array values are happily overwritten. So it seems to me like overwriting dataset attrs is at least sort of consistent? With mode="a" / append_dim, then the array op is more like an outer join, so in that case you might expect some conflict handling.
Personally I find mode="a" pretty confusing overall and try to avoid it whenever possible.
@slevang the difference is with mode="a" only included arrays are overwritten, while xarray.Dataset attrs (i.e. root attrs) are always overwritten, even when empty.
Personally I find mode="a" pretty confusing
For an xarray.Dataset with multiple variables/arrays mode="a" is very useful. For instance when you have an analysis pipeline producing different independent variables/arrays at different times, or if you want to modify one step in the pipeline and then recompute and save just that output, it allows you to append/modify whole arrays easily.
When the file only has one variable/array, such as after performing xarray.DataArray.to_zarr(), mode="a" is essentially the same as mode="w".
Yeah I think that's a reasonable distinction.
Totally agree mode="a" is very useful, my point is just that we try to push a lot of different (and potentially surprising) functionality into it. You can append along a dimension, append new variables, overwrite existing variables, etc. You can use it with append_dim or region. Then this issue raises the question of attribute handling, and array vs dataset attrs. This is all a lot of stuff packed into one option for new (or experienced!) users to figure out.
Agree 💯 @slevang.
Seems like what is needed is a more nuanced Zarr insert / update / upsert / merge utility with a richer syntax.
Your mention of merge gets me thinking in terms of the standard xarray ops every knows well. mode="a" basically covers what xr.merge does, and between compat, join, and combine_attrs you have great power to enumerate all the potential combination options. mode="r+" is like direct array assignment ds[x][:10, :] = .... The trick with moving the API in this direction is that a lot of these type of comparison ops have significant overhead when the data is stored on disk so we want to preserve "fast path" optimizations more conducive to i/o.