Rework PandasMultiIndex.sel internals
- [x] Closes #6838
- [ ] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst
This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in .sel().
More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels.
PandasMultiIndex.sel() relies on the underlying pandas.MultiIndex methods like this:
- use
get_locwhen all levels are provided with each a scalar label (no slice, no array)- always drops the index and returns scalar coordinates for each multi-index level
- use
get_loc_levelwhen only a subset of levels are provided with scalar labels only- may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates)
- if only one level remains: renames the dimension and the corresponding dimension coordinate
- use
get_locsfor all other cases.- always keeps the multi-index and its coordinates (even if only one item or one level is selected)
This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result.
Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838):
import xarray as xr
import pandas as pd
midx = pd.MultiIndex.from_product([list("abc"), range(4)], names=("one", "two"))
ds = xr.Dataset(coords={"x": midx})
# <xarray.Dataset>
# Dimensions: (x: 12)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'
# * two (x) int64 0 1 2 3 0 1 2 3 0 1 2 3
# Data variables:
# *empty*
ds.sel(one="a", two=0)
# this PR
#
# <xarray.Dataset>
# Dimensions: ()
# Coordinates:
# x object ('a', 0)
# one <U1 'a'
# two int64 0
# Data variables:
# *empty*
#
# v2022.3.0
#
# <xarray.Dataset>
# Dimensions: ()
# Coordinates:
# x object ('a', 0)
# Data variables:
# *empty*
#
ds.sel(one="a")
# this PR:
#
# <xarray.Dataset>
# Dimensions: (two: 4)
# Coordinates:
# * two (two) int64 0 1 2 3
# one <U1 'a'
# Data variables:
# *empty*
#
# v2022.3.0
#
# <xarray.Dataset>
# Dimensions: (two: 4)
# Coordinates:
# * two (two) int64 0 1 2 3
# Data variables:
# *empty*
#
ds.sel(one=slice("a", "b"))
# this PR
#
# <xarray.Dataset>
# Dimensions: (x: 8)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b'
# * two (x) int64 0 1 2 3 0 1 2 3
# Data variables:
# *empty*
#
# v2022.3.0
#
# <xarray.Dataset>
# Dimensions: (two: 8)
# Coordinates:
# * two (two) int64 0 1 2 3 0 1 2 3
# Data variables:
# *empty*
#
ds.sel(one="a", two=slice(1, 1))
# this PR
#
# <xarray.Dataset>
# Dimensions: (x: 1)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'a'
# * two (x) int64 1
# Data variables:
# *empty*
#
# v2022.3.0
#
# <xarray.Dataset>
# Dimensions: (x: 1)
# Coordinates:
# * x (x) MultiIndex
# - one (x) object 'a'
# - two (x) int64 1
# Data variables:
# *empty*
#
ds.sel(one=["b", "c"], two=[0, 2])
# this PR
#
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object MultiIndex
# * one (x) object 'b' 'b' 'c' 'c'
# * two (x) int64 0 2 0 2
# Data variables:
# *empty*
#
# v2022.3.0
#
# ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level)
#
it is now allowed to provide array-like labels.
Hmm not sure if it's a good idea... I find get_locs() a bit confusing like in the example below where a 4-labels array for level "one" returns a 3-items location integer array:
# is the 3rd label ("b") ignored?
midx.get_locs((np.array(["b", "a", "b", "c"]), 0))
# array([4, 0, 8])
That differs too much from the vectorized selection based on single pandas indexes...
Fancy indexing with n-d label arrays doesn't work either:
midx.get_locs((np.array([["a", "a"], ["a", "a"]]), 0))
# InvalidIndexError: [['a' 'a']
# ['a' 'a']]
And providing Variable or DataArray objects as labels would make things event harder, unless we ignore their dimension names and coordinates (but then it wouldn't be consistent with vectorized selection based on single pandas indexes).
Probably not worth it then?
It would be nice to be able to preserve the MultiIndex with sel (e.g. ds.sel(one=["a"]) but if it makes the behavior inconsistent it is no good either...