xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Rework PandasMultiIndex.sel internals

Open benbovy opened this issue 3 years ago • 2 comments

  • [x] Closes #6838
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

This PR hopefully improves how are handled the labels that are provided for multi-index level coordinates in .sel().

More specifically, slices are handled in a cleaner way and it is now allowed to provide array-like labels.

PandasMultiIndex.sel() relies on the underlying pandas.MultiIndex methods like this:

  • use get_loc when all levels are provided with each a scalar label (no slice, no array)
    • always drops the index and returns scalar coordinates for each multi-index level
  • use get_loc_level when only a subset of levels are provided with scalar labels only
    • may collapse one or more levels of the multi-index (dropped levels result in scalar coordinates)
    • if only one level remains: renames the dimension and the corresponding dimension coordinate
  • use get_locs for all other cases.
    • always keeps the multi-index and its coordinates (even if only one item or one level is selected)

This yields a predictable behavior: as soon as one of the provided labels is a slice or array-like, the multi-index and all its level coordinates are kept in the result.

Some cases illustrated below (I compare this PR with an older release due to the errors reported in #6838):

import xarray as xr
import pandas as pd

midx = pd.MultiIndex.from_product([list("abc"), range(4)], names=("one", "two"))
ds = xr.Dataset(coords={"x": midx})    
# <xarray.Dataset>
# Dimensions:  (x: 12)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b' 'c' 'c' 'c' 'c'
#   * two      (x) int64 0 1 2 3 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
ds.sel(one="a", two=0)

# this PR
#
# <xarray.Dataset>
# Dimensions:  ()
# Coordinates:
#     x        object ('a', 0)
#     one      <U1 'a'
#     two      int64 0
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  ()
# Coordinates:
#     x        object ('a', 0)
# Data variables:
#     *empty*
# 
ds.sel(one="a")

# this PR:
#
# <xarray.Dataset>
# Dimensions:  (two: 4)
# Coordinates:
#  * two      (two) int64 0 1 2 3
#    one      <U1 'a'
# Data variables:
#    *empty*
#

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (two: 4)
# Coordinates:
#   * two      (two) int64 0 1 2 3
# Data variables:
#     *empty*
# 
ds.sel(one=slice("a", "b"))

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 8)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'a' 'a' 'b' 'b' 'b' 'b'
#   * two      (x) int64 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (two: 8)
# Coordinates:
#   * two      (two) int64 0 1 2 3 0 1 2 3
# Data variables:
#     *empty*
# 
ds.sel(one="a", two=slice(1, 1))

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 1)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a'
#   * two      (x) int64 1
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# <xarray.Dataset>
# Dimensions:  (x: 1)
# Coordinates:
#   * x        (x) MultiIndex
#   - one      (x) object 'a'
#   - two      (x) int64 1
# Data variables:
#     *empty*
# 
ds.sel(one=["b", "c"], two=[0, 2])

# this PR
# 
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'b' 'b' 'c' 'c'
#   * two      (x) int64 0 2 0 2
# Data variables:
#     *empty*
# 

# v2022.3.0
# 
# ValueError: Vectorized selection is not available along coordinate 'one' (multi-index level)
# 

benbovy avatar Sep 07 '22 14:09 benbovy

it is now allowed to provide array-like labels.

Hmm not sure if it's a good idea... I find get_locs() a bit confusing like in the example below where a 4-labels array for level "one" returns a 3-items location integer array:

# is the 3rd label ("b") ignored?

midx.get_locs((np.array(["b", "a", "b", "c"]), 0))
# array([4, 0, 8])

That differs too much from the vectorized selection based on single pandas indexes...

Fancy indexing with n-d label arrays doesn't work either:

midx.get_locs((np.array([["a", "a"], ["a", "a"]]), 0))
# InvalidIndexError: [['a' 'a']
#  ['a' 'a']]

And providing Variable or DataArray objects as labels would make things event harder, unless we ignore their dimension names and coordinates (but then it wouldn't be consistent with vectorized selection based on single pandas indexes).

Probably not worth it then?

benbovy avatar Sep 08 '22 09:09 benbovy

It would be nice to be able to preserve the MultiIndex with sel (e.g. ds.sel(one=["a"]) but if it makes the behavior inconsistent it is no good either...

mathause avatar Sep 22 '22 20:09 mathause