dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Implicit alignment in operations

Open TomAugspurger opened this issue 5 years ago • 10 comments

In https://github.com/pydata-apis/dataframe-api/issues/2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.

In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])

In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])

In [12]: a
Out[12]:
   A
a  1
b  2
c  3

In [13]: b
Out[13]:
   A
b  2
c  3
a  1

In [14]: a + b
Out[14]:
   A
a  2
b  4
c  6

In the background there's an implicit a.align(b), which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.

A few other places this occurs

  • Indexing a DataFrame / Series with an integer or boolean series
  • pd.concat
  • DataFrame constructor

Do we want to adopt this behavior for the standard?

TomAugspurger avatar Jun 05 '20 17:06 TomAugspurger

In #2 there seems to be some agreement that row-labels are an important component of a dataframe.

Eh, just to make sure, can you summarize that agreement? As far as I can see you suggested that including row labels was inappropriate, and @devin-petersohn was in favor but also noted that columnar dataframes like Vaex and R tidyverse do not support row labels. Hence my impression was that row labels should be optional or a "level 1" feature.

rgommers avatar Jun 05 '20 20:06 rgommers

@rgommers I do not believe that the presence of row labels affects this conversation directly because a + b could be reasonably done with the position instead of row labels, and in that case the position is the row's label. Side note: let's figure out this "levels" thing. It is hard to have meaningful conversations without the concrete levels.

@TomAugspurger This brings up an interesting discussion about joining/manipulating the row labels (or order) and how that interacts with the data in a dataframe.

If we dissect the a + b operation, we are effectively doing a join along both axes, and adding (or another binary operation) on label collisions in both axes. This is a bit unusual from the database perspective, but it can be done (though it is tedious).

So there is this ability to treat labels as data that can be joined on, or manipulating the order of the rows with an align-style join. It's a very nice property for visualization.

devin-petersohn avatar Jun 06 '20 14:06 devin-petersohn

Re-reading #2, it does seem that I overstated the level of support for row labels.

As Devin notes, there's a positional version of label alignment:

In [9]: a = pd.Series([1.0, 2.0])

In [10]: b = pd.Series([1.0, 2.0, 3.0])

In [11]: a
Out[11]:
0    1.0
1    2.0
dtype: float64

In [12]: b
Out[12]:
0    1.0
1    2.0
2    3.0
dtype: float64

In [13]: a + b
Out[13]:
0    2.0
1    4.0
2    NaN
dtype: float64

So do we expect that operation to raise (different shapes) or align (by position). My recommendation would be to align.

TomAugspurger avatar Jun 08 '20 13:06 TomAugspurger

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...

amueller avatar Jun 08 '20 15:06 amueller

We'll want to explicitly state the expected behavior for column names. I'd expect it to match the behavior for row labels.

On Mon, Jun 8, 2020 at 10:58 AM Andreas Mueller [email protected] wrote:

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/12#issuecomment-640719139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQGJRTI2YW7ZNGGFSDRVUDEHANCNFSM4NUM7E3A .

TomAugspurger avatar Jun 08 '20 16:06 TomAugspurger

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

In [77]: df1 = vaex.from_dict({"A": [1, 2, 3]})

In [78]: df1.A[:2] + df1.A
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-78-df2975e4c3c9> in <module>
----> 1 df1.A[:2] + df1.A

~/miniconda3/envs/vaex/lib/python3.8/site-packages/vaex/expression.py in f(a, b)
    111                     else:
    112                         if isinstance(b, Expression):
--> 113                             assert b.ds == a.ds
    114                             b = b.expression
    115                         elif isinstance(b, (np.timedelta64)):

AssertionError:

This might be a silly questions but I guess we agree/assume that there's alignment for column names, right

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Do other dataframe implementors want to weigh in on what's desired / feasible here?

From Dask's perspective, alignment is doable. We partition by divisions on the index. When those divisions aren't available a full shuffle is needed to do the operation.

TomAugspurger avatar Jun 09 '20 21:06 TomAugspurger

Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):

>>> import pandas
>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
...                        'capital': ['Paris', 'DC', 'London']})
>>> df
  country capital
0  France   Paris
1     USA      DC
2      UK  London

Basic case, same size, same index, in the same order (I guess whatever we do, it will work):

>>> df['country'] + ' - ' + df['capital']
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Same size and index, but index in different order. With row labels and automatic alignment, what we have is:

>>> df['country'] + ' - ' + df['capital'].sort_values()
0    France - Paris
1          USA - DC
2       UK - London
dtype: object

Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').

When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:

>>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']
0    France - Paris
1               NaN
2       UK - London
dtype: object

Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').

So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:

  • Do we want row labels?
  • Do we want automatic alignment?
  • Do we want to automatically create NA rows if the index values don't match?

datapythonista avatar Jun 19 '20 14:06 datapythonista

Thanks for the summary Marc. I think your three bullets perfectly capture the three levels to this issue.

I suppose there might be one more question: Do we leave binary operations between DataFrame objects out of the spec entirely? That sidesteps the issue of row labels & alignment. And if we do allow binary operations between

  • DataFrame & scalars
  • DataFrame & arrays (where an array is an unlabeled column of a dataframe. Only requirement is that the shape is compatible.)

then perhaps there isn't much of a loss in functionality?

On Fri, Jun 19, 2020 at 9:55 AM Marc Garcia [email protected] wrote:

Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):

import pandas>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'], ... 'capital': ['Paris', 'DC', 'London']})>>> df country capital0 France Paris1 USA DC2 UK London

Basic case, same size, same index, in the same order (I guess whatever we do, it will work):

df['country'] + ' - ' + df['capital']0 France - Paris1 USA - DC2 UK - Londondtype: object

Same size and index, but index in different order. With row labels and automatic alignment, what we have is:

df['country'] + ' - ' + df['capital'].sort_values()0 France - Paris1 USA - DC2 UK - Londondtype: object

Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').

When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:

df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']0 France - Paris1 NaN2 UK - Londondtype: object

Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').

So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:

  • Do we want row labels?
  • Do we want automatic alignment?
  • Do we want to automatically create NA rows if the index values don't match?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/12#issuecomment-646682006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQWVRNLGXYZWWVGVSTRXN37PANCNFSM4NUM7E3A .

TomAugspurger avatar Jun 22 '20 14:06 TomAugspurger

For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match

It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)

Yes, basically vaex does not have row labels, so both operations do not make sense in the current state. There is a branch which lets the dataframe behave like a 2d array (nep13/nep18), meaning implicit row labels that are row numbers. In the case of a binary operator it will ignore the column labels, and only use the column index, similar to a 2d array.

maartenbreddels avatar Jun 25 '20 16:06 maartenbreddels