xarray icon indicating copy to clipboard operation
xarray copied to clipboard

assign_coords' behavior depends on input DataArray's

Open yichechang opened this issue 2 years ago • 7 comments

What happened?

I'm trying to compute masks (from DataArray's data itself) and assign them as coordinates, but it appears that depending on the combination of coords/dims of the computed masks, sometimes .assign_coords will fail.

It seems like

  • it fails when all the mask DataArray's (each is a mask computed, but it probably doesn't matter) to be assigned as coordinates, share a dimension common to the target DataArray, and the dimension contains only a singular value (across all mask DataArray's)
  • it doesn't fail when the shared dimension contains more than one value.

It's a bit hard to describe as I don't know the xarray internal itself, but my self-contained minimal example below should demonstrate the issue much clearer.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)


# this will fail:
data.assign_coords({'mask_d1_m': data.sel(d1='m')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# this will fail too:
data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})
# ValueError: dimension 'd1' already exists as a scalar variable

# but this will work:
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
        'mask_d1_n': data.sel(d1='n')==0
    }
)
# <xarray.DataArray (d1: 2, d2: 3)>
# array([[0, 1, 2],
#        [0, 1, 2]])
# Coordinates:
#   * d1         (d1) <U1 'm' 'n'
#   * d2         (d2) <U1 'a' 'b' 'c'
#     mask_d1_m  (d2) bool True False False
#     mask_d1_n  (d2) bool True False False

MVCE confirmation

  • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
  • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[27], line 1
----> 1 data.assign_coords({'mask_d1_n': data.sel(d1='n')==0})

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/common.py:615, in DataWithCoords.assign_coords(self, coords, **coords_kwargs)
    613 data = self.copy(deep=False)
    614 results: dict[Hashable, Any] = self._calc_assign_results(coords_combined)
--> 615 data.coords.update(results)
    616 return data

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:177, in Coordinates.update(self, other)
    173 self._maybe_drop_multiindex_coords(set(other_vars))
    174 coords, indexes = merge_coords(
    175     [self.variables, other_vars], priority_arg=1, indexes=self.xindexes
    176 )
--> 177 self._update_coords(coords, indexes)

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/coordinates.py:393, in DataArrayCoordinates._update_coords(self, coords, indexes)
    391 coords_plus_data = coords.copy()
    392 coords_plus_data[_THIS_ARRAY] = self._data.variable
--> 393 dims = calculate_dimensions(coords_plus_data)
    394 if not set(dims) <= set(self.dims):
    395     raise ValueError(
    396         "cannot add coordinates with new dimensions to a DataArray"
    397     )

File ~/mambaforge/envs/quickquant/lib/python3.8/site-packages/xarray/core/variable.py:3209, in calculate_dimensions(variables)
   3207 for dim, size in zip(var.dims, var.shape):
   3208     if dim in scalar_vars:
-> 3209         raise ValueError(
   3210             f"dimension {dim!r} already exists as a scalar variable"
   3211         )
   3212     if dim not in dims:
   3213         dims[dim] = size

ValueError: dimension 'd1' already exists as a scalar variable

Anything else we need to know?

No response

Environment

~/mambaforge/envs/quickquant/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit: None python: 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:11:32) [Clang 14.0.6 ] python-bits: 64 OS: Darwin OS-release: 22.6.0 machine: arm64 processor: arm byteorder: little LC_ALL: en_US.UTF-8 LANG: None LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2023.1.0 pandas: 1.5.3 numpy: 1.24.0 scipy: 1.10.1 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.15.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2023.5.0 distributed: 2023.5.0 matplotlib: 3.7.2 cartopy: None seaborn: 0.12.2 numbagg: None fsspec: 2023.9.0 cupy: None pint: 0.21 sparse: None flox: None numpy_groupies: None setuptools: 68.1.2 pip: 23.2.1 conda: 23.7.3 pytest: 7.4.1 mypy: None IPython: 8.12.2 sphinx: 4.5.0

yichechang avatar Sep 13 '23 21:09 yichechang

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Sep 13 '23 21:09 welcome[bot]

Thanks for the issue. What would you expect the output to be?

It does seem surprising that passing two arguments succeeds while passing each of them alone succeeds...

A partial look — adding .drop_vars('d1') makes it succeed, because the argument has a d1 dimension.

data.sel(d1='m')==0

<xarray.DataArray (d2: 3)>
array([ True, False, False])
Coordinates:
    d1       <U1 'm'
  * d2       (d2) <U1 'a' 'b' 'c'
[nav] In [36]: data.assign_coords(
          ...:     {
          ...:         'mask_d1_m': (data.sel(d1='n')==0).drop_vars('d1')
          ...: #        'mask_d1_n': data.sel(d1='n')==0
          ...:     }
          ...: )

...though I'm not sure why it succeeds when there are two arguments.

max-sixty avatar Sep 14 '23 03:09 max-sixty

Hi - Thanks for the reply and testing!

I guess I would expect that all my examples would all fail, or all succeed.

Sorry I didn't include what I would expect... because I didn't really know what to expect since I haven't thought about why it failed. (But it seems like it should fail, as now it became clearer to me that my mask_d1_m should not have dimension d1?). From the error message, I would guess that doing .drop_vars('d1') would fix the problem (which is true as you demonstrated above!), but I couldn't wrap my head around why a DataArray of dim ('d1', 'd2') cannot be passed as coordinates for DataArray of dim ('d1', 'd2')?

Edit: Also just wanted to point out that it's not just having two arguments that prevents it from failing, it has to be that d1 dim has both d1='m' and d1='n'.

import xarray as xr

data = xr.DataArray(
    data=[
        [0, 1, 2], 
        [0, 1, 2]
    ], 
    coords={
        'd1': ['m', 'n'], 
        'd2': ['a', 'b', 'c']
    }
)


# so this will FAIL
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='m')==1,
#                                ^^^
    }
)

# but this will SUCCEED
data.assign_coords(
    {
        'mask_d1_m': data.sel(d1='m')==0,
#                                ^^^
        'mask_d1_n': data.sel(d1='n')==1,
#                                ^^^
    }
)

yichechang avatar Sep 14 '23 04:09 yichechang

Just wanted to give an update: For my own application (computing masks for a DataArray using its own data, and then attach resulting masks to the DataArray as its new coords), I should make sure my masks contain only labels for its dimensions, in the first place. So, something like

# compute mask based on my data
# mask = data.sel(d1=...) > whatever

# remove labels for extra dimensions
mask = mask.drop_vars([v for v in mask.coords if v not in mask.dims]

# assign mask as coords for the original DataArray
data = data.assign_coords({'my_mask': mask})

But the inconsistent behaviors still seem like a bug, or there's some magic happening during the process of combining multiple coords assignments, that are not explicitly documented. It is not too much of an issue, as

  1. the error messages kind of give enough hint on how to do things correctly, and
  2. it's more so just a surprise that I can quickly tell myself that "okay, I'm doing something wrong, but there was some magic that hide this when I assign coords in a certain way. fix it and move on."

Therefore, I'll leave this issue open but please feel free to close it if this isn't something to be fixed. From my point of view, the inconsistency is surprising, but not a major issue. Thank you for taking the time testing!

yichechang avatar Sep 16 '23 20:09 yichechang

as far as I can tell, the cause for the surprising/inconsistent behavior is that xr.core.coordinates.create_coords_with_default_indexes (or xr.core.merge.merge_coordinates_without_align, I didn't step through create_coords_with_default_indexes) drops scalar coordinates with conflicting values. cc @benbovy

In your case, I think the easiest way to work around this is to use .sel(..., drop=True) which will not keep the scalar coordinate.

keewis avatar Sep 17 '23 18:09 keewis

Hi @keewis - Thank you for the explanation!

Yeah, I was digging into the codebase a little bit, but unfortunately -- as it is probably evident that I'm not that familiar with xarray's internals -- I was a bit lost. Not that I completely understand now, but I am grateful for an explanation so I know I'm not crazy 🤣

Also thank you for the recommended route with the drop kwarg in DataArray.sel method. This is more succinct and can convey my intention more explicitly. Good to learn how to do things correctly... I think I need to sit down and re-learn xarray. I've been using it for a long time, but usually I can get it to do what I want while knowing there must be something I'm still missing.

Really appreciate all of your help : )

yichechang avatar Sep 17 '23 18:09 yichechang

Since we are relaxing the constraints that are related to dimension coordinates (e.g., #7989), I'm wondering if we couldn't also relax the case where a scalar coordinate has the same name as a dimension.

I don't think that this would help much here, though. Using .assign_coords with a dictionary of DataArray objects extracts all their coordinates and tries to merge them with the current Dataset or DataArray. In many cases this is good but sometimes this could have unwanted side effects. Using drop may help, or alternatively we can ignore all the coordinates (and keep the dimension names) by extracting the variable from the DataArray:

data.assign_coords({'mask_d1_m': (data.sel(d1='m')==0).variable})

benbovy avatar Sep 18 '23 00:09 benbovy