xarray icon indicating copy to clipboard operation
xarray copied to clipboard

to_zarr removes global attributes in destination dataset

Open pnorton-usgs opened this issue 1 year ago • 12 comments

What happened?

Adding new variables to a zarr dataset with to_zarr() always removes the existing global attributes. New global attributes in the source dataset are not always added to the destination dataset depending on how to_zarr() is called.

What did you expect to happen?

I would expect that existing global attributes would always be preserved. If there are new global attributes I would expect them to be added to the existing global attributes instead of replacing all existing global attributes.

Minimal Complete Verifiable Example

import xarray as xr
from pyproj import CRS

local_zarr = 'sample.zarr'

ds_sample = xr.tutorial.load_dataset("air_temperature")

# Make a local copy
ds_sample.to_zarr(local_zarr, mode='w')
ds_sample = xr.open_dataset(local_zarr, engine='zarr', 
                            backend_kwargs={'consolidated':True}, chunks={}, 
                            decode_coords=True)

# Create CRS metadata
crs_meta = CRS.from_epsg(4326).to_cf()

ds_new = xr.Dataset(data_vars={"crs": ([], 1, crs_meta)})
ds_new.attrs['note'] = 'please add this'

# Add all variables from ds_new to the zarr
# NOTE: This adds the new global attribute but also removes
#       all existing global attributes
ds_new.to_zarr(local_zarr, mode='a')

# Add selected variable(s) to zarr dataset
# NOTE: This does not copy new global attributes 
#       and removes all existing global attributes
# ds_new['crs'].to_zarr(local_zarr, mode='a')

# Re-open local zarr store
ds_sample = xr.open_dataset(local_zarr, engine='zarr', 
                            backend_kwargs={'consolidated':True}, chunks={}, 
                            decode_coords=True)

ds_sample

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
  • [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None python: 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:26:40) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

xarray: 2024.1.1 pandas: 2.2.0 numpy: 1.26.4 scipy: 1.12.0 netCDF4: 1.6.0 pydap: installed h5netcdf: 1.3.0 h5py: 3.8.0 Nio: None zarr: 2.17.0 cftime: 1.6.3 nc_time_axis: None iris: None bottleneck: 1.3.7 dask: 2024.2.0 distributed: 2024.2.0 matplotlib: 3.8.2 cartopy: 0.22.0 seaborn: None numbagg: None fsspec: 2023.12.2 cupy: None pint: 0.23 sparse: None flox: None numpy_groupies: None setuptools: 69.0.3 pip: 24.0 conda: None pytest: 8.0.0 mypy: None IPython: 8.21.0 sphinx: None

pnorton-usgs avatar Feb 15 '24 15:02 pnorton-usgs

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Feb 15 '24 15:02 welcome[bot]

I'm not a maintainer, but ... not sure if this is a bug:
https://docs.xarray.dev/en/stable/getting-started-guide/faq.html#what-is-your-approach-to-metadata

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option, and there is a global flag, accessible with xarray.set_options(), for setting this to be always True or False). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

Sorry if I'm adding noise to the issue, just commenting because I'm looking now at using to_zarr with a DataArray that will have attrs

NickleDave avatar Feb 24 '24 14:02 NickleDave

I believe this is a bug. The Modifying existing Zarr stores section of the documentation outlines ways to modify existing zarr stores and to limit what data is modified using mode, append_dim, and region. And yet dataset attrs is always overwritten which is unintuitive and is in fact propagating the local attrs to the saved file in an undesirable fashion. With mode='a', and I'm guessing also with mode='a-':

  • xarray.Dataset.to_zarr() overwrites the saved dataset attrs with its own, even when it has none.
  • xarray.DataArray.to_zarr() overwrites the saved dataset attrs with its own, which is {} since it's a DataArray and doesn't have Dataset attrs.

elyall avatar Apr 24 '24 19:04 elyall

What's the desired behaviour here? Never update zarr.Group.attrs when appending?

dcherian avatar Apr 24 '24 19:04 dcherian

I'm not knowledgeable of the zarr API, but my desired behavior when appending is to not update the root (Dataset) attrs. @pnorton-usgs's example shows a use case where it seems they would like to append to and overwrite conflicts of root attrs, though this would require more processing and might not be desirable for use cases with large metadata.

However this is separate from dealing with variable (DataArray) attrs. When appending a new variable using mode='a' it is desired to have the added variables' attrs get added, as is the current case. Whereas when appending to or modifying an existing variable with mode='a-' and some combination of append_dim and region it is likely desirable to not touch the variables' attrs, though I'm not familiar with the status quo here.

elyall avatar Apr 24 '24 20:04 elyall

Looking briefly over the code it seems like we'd just need to change https://github.com/pydata/xarray/blob/8a23e249d44b6e677600244efed4491a2749db73/xarray/backends/zarr.py#L650-L651 to include mode not in ["a", "a-"].

Variable attrs seems to be written here, which we can leave untouched: https://github.com/pydata/xarray/blob/8a23e249d44b6e677600244efed4491a2749db73/xarray/backends/zarr.py#L779

elyall avatar Apr 25 '24 18:04 elyall

I think not updating Group attrs when appending sounds right. It is consistent with the Array writing behaviour (it seems like).

Handling conflicts seems a bit too much complexity.

cc @rabernat @slevang @jhamman for opinions

dcherian avatar Apr 25 '24 21:04 dcherian

With mode="a", I'm not sure this does match the array behavior. If you don't set append_dim, then array values are happily overwritten. So it seems to me like overwriting dataset attrs is at least sort of consistent? With mode="a" / append_dim, then the array op is more like an outer join, so in that case you might expect some conflict handling.

Personally I find mode="a" pretty confusing overall and try to avoid it whenever possible.

slevang avatar Apr 27 '24 18:04 slevang

@slevang the difference is with mode="a" only included arrays are overwritten, while xarray.Dataset attrs (i.e. root attrs) are always overwritten, even when empty.

Personally I find mode="a" pretty confusing

For an xarray.Dataset with multiple variables/arrays mode="a" is very useful. For instance when you have an analysis pipeline producing different independent variables/arrays at different times, or if you want to modify one step in the pipeline and then recompute and save just that output, it allows you to append/modify whole arrays easily.

When the file only has one variable/array, such as after performing xarray.DataArray.to_zarr(), mode="a" is essentially the same as mode="w".

elyall avatar Apr 29 '24 16:04 elyall

Yeah I think that's a reasonable distinction.

Totally agree mode="a" is very useful, my point is just that we try to push a lot of different (and potentially surprising) functionality into it. You can append along a dimension, append new variables, overwrite existing variables, etc. You can use it with append_dim or region. Then this issue raises the question of attribute handling, and array vs dataset attrs. This is all a lot of stuff packed into one option for new (or experienced!) users to figure out.

slevang avatar Apr 29 '24 17:04 slevang

Agree 💯 @slevang.

Seems like what is needed is a more nuanced Zarr insert / update / upsert / merge utility with a richer syntax.

rabernat avatar Apr 29 '24 17:04 rabernat

Your mention of merge gets me thinking in terms of the standard xarray ops every knows well. mode="a" basically covers what xr.merge does, and between compat, join, and combine_attrs you have great power to enumerate all the potential combination options. mode="r+" is like direct array assignment ds[x][:10, :] = .... The trick with moving the API in this direction is that a lot of these type of comparison ops have significant overhead when the data is stored on disk so we want to preserve "fast path" optimizations more conducive to i/o.

slevang avatar Apr 29 '24 19:04 slevang