xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add "unique()" method, mimicking pandas

Open ahuang11 opened this issue 6 years ago • 6 comments

Would it be good to add a unique() method that mimics pandas?

import pandas as pd
import xarray as xr
pd.Series([0, 1, 1, 2]).unique()
xr.DataArray([0, 1, 1, 2]).unique()  # not implemented

Output:

array([0, 1, 2])
AttributeError: 'DataArray' object has no attribute 'unique'

ahuang11 avatar Feb 28 '19 18:02 ahuang11

What would .unique() return on xarray.DataArray? For consistency with pandas, I guess it would return a 1D numpy or dask array?

I don't see a lot of value in adding this to xarray, given that all the xarray metadata gets lost by the unique() operation. You might as well just write np.unique(my_data_array.data).

shoyer avatar Mar 04 '19 07:03 shoyer

Right, it would return a 1D numpy or dask array.

I suppose I'm used to simply typing pd.Series().unique() rather than np.unique(pd.Series()).

I use it in for loops primarily. for season in da['time.season'].unique(): vs for season in np.unique(da['time.season'].data):

ahuang11 avatar Mar 05 '19 00:03 ahuang11

Hi, I also vote for this function, My typical use-case.

There is some structure in 3D space and I need to "flatten it" to 2D. Let us say it is axially symetric so I assign R and Z coordinate to points (or r and theta in polar). And I want to simplify this using interp; however, it requuires unique coordinates.

I have some solution here: https://stackoverflow.com/questions/51058379/drop-duplicate-times-in-xarray

and adapted this into actuall function:

def distribure_uniform(ds, N_points=512):

    ds_theta = ds.sortby("theta").swap_dims({"idx": "theta"})
    _, index = np.unique(ds_theta['theta'], return_index=True)

    ds_theta = ds_theta.isel(theta=index)

    ds_theta = ds_theta.interp(
        theta=np.linspace(ds.theta.min(), ds.theta.max(), N_points))

    ds_theta = ds_theta.swap_dims({"theta": "idx"})
    return ds_theta

In an idal case I would like to write something like this:

def distribure_uniform(ds, N_points=512):

    ds_theta= ds.unique("theta", sorted=False, sort=True)

    ds_theta = ds_theta.swap_dims({"idx": "theta"})
    ds_theta = ds_theta.interp(
        theta=np.linspace(ds.theta.min(), ds.theta.max(), N_points))
    ds_theta = ds_theta.swap_dims({"theta": "idx"})
    return ds_theta

kripnerl avatar Oct 16 '20 11:10 kripnerl

A case I ran into where supporting .unique() in the pandas sense would be helpful is when an object dtype is used to support nullable strings:

>>> ar = xr.DataArray(np.array(['foo', np.nan], dtype='object'), coords={'bar': range(2)}, name='foo')
>>> np.unique(ar.data)
TypeError: '<' not supported between instances of 'float' and 'str'
>>> ar.to_dataframe().foo.unique()
array(['foo', nan], dtype=object)

aaronsarna avatar Jan 08 '24 16:01 aaronsarna

Actually, pd.unique(ar) also works fine here, so maybe there's no need to add it to xarray.

aaronsarna avatar Jan 08 '24 16:01 aaronsarna

I guess the limitation on using pd.unique() is that it requires 1D data. pd.unique(ar.data.flatten()) isn't so painful, but that feels like the kind of thing xarray should do for you.

aaronsarna avatar Jan 08 '24 17:01 aaronsarna