Implicit alignment in operations
In https://github.com/pydata-apis/dataframe-api/issues/2 there seems to be some agreement that row-labels are an important component of a dataframe. Pandas takes this a step further by using them for alignment in many operations involving multiple dataframes.
In [10]: a = pd.DataFrame({"A": [1, 2, 3]}, index=['a', 'b', 'c'])
In [11]: b = pd.DataFrame({"A": [2, 3, 1]}, index=['b', 'c', 'a'])
In [12]: a
Out[12]:
A
a 1
b 2
c 3
In [13]: b
Out[13]:
A
b 2
c 3
a 1
In [14]: a + b
Out[14]:
A
a 2
b 4
c 6
In the background there's an implicit a.align(b), which reindexes the dataframes to a common index. The resulting index will be the union of the two indices.
A few other places this occurs
- Indexing a DataFrame / Series with an integer or boolean series
-
pd.concat - DataFrame constructor
Do we want to adopt this behavior for the standard?
In #2 there seems to be some agreement that row-labels are an important component of a dataframe.
Eh, just to make sure, can you summarize that agreement? As far as I can see you suggested that including row labels was inappropriate, and @devin-petersohn was in favor but also noted that columnar dataframes like Vaex and R tidyverse do not support row labels. Hence my impression was that row labels should be optional or a "level 1" feature.
@rgommers I do not believe that the presence of row labels affects this conversation directly because a + b could be reasonably done with the position instead of row labels, and in that case the position is the row's label. Side note: let's figure out this "levels" thing. It is hard to have meaningful conversations without the concrete levels.
@TomAugspurger This brings up an interesting discussion about joining/manipulating the row labels (or order) and how that interacts with the data in a dataframe.
If we dissect the a + b operation, we are effectively doing a join along both axes, and adding (or another binary operation) on label collisions in both axes. This is a bit unusual from the database perspective, but it can be done (though it is tedious).
So there is this ability to treat labels as data that can be joined on, or manipulating the order of the rows with an align-style join. It's a very nice property for visualization.
Re-reading #2, it does seem that I overstated the level of support for row labels.
As Devin notes, there's a positional version of label alignment:
In [9]: a = pd.Series([1.0, 2.0])
In [10]: b = pd.Series([1.0, 2.0, 3.0])
In [11]: a
Out[11]:
0 1.0
1 2.0
dtype: float64
In [12]: b
Out[12]:
0 1.0
1 2.0
2 3.0
dtype: float64
In [13]: a + b
Out[13]:
0 2.0
1 4.0
2 NaN
dtype: float64
So do we expect that operation to raise (different shapes) or align (by position). My recommendation would be to align.
This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...
We'll want to explicitly state the expected behavior for column names. I'd expect it to match the behavior for row labels.
On Mon, Jun 8, 2020 at 10:58 AM Andreas Mueller [email protected] wrote:
This might be a silly questions but I guess we agree/assume that there's alignment for column names, right? There's also a question whether to raise there on misalignment or drop or create nan columns...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/12#issuecomment-640719139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQGJRTI2YW7ZNGGFSDRVUDEHANCNFSM4NUM7E3A .
For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match
In [77]: df1 = vaex.from_dict({"A": [1, 2, 3]})
In [78]: df1.A[:2] + df1.A
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-78-df2975e4c3c9> in <module>
----> 1 df1.A[:2] + df1.A
~/miniconda3/envs/vaex/lib/python3.8/site-packages/vaex/expression.py in f(a, b)
111 else:
112 if isinstance(b, Expression):
--> 113 assert b.ds == a.ds
114 b = b.expression
115 elif isinstance(b, (np.timedelta64)):
AssertionError:
This might be a silly questions but I guess we agree/assume that there's alignment for column names, right
It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)
Do other dataframe implementors want to weigh in on what's desired / feasible here?
From Dask's perspective, alignment is doable. We partition by divisions on the index. When those divisions aren't available a full shuffle is needed to do the operation.
Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):
>>> import pandas
>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'],
... 'capital': ['Paris', 'DC', 'London']})
>>> df
country capital
0 France Paris
1 USA DC
2 UK London
Basic case, same size, same index, in the same order (I guess whatever we do, it will work):
>>> df['country'] + ' - ' + df['capital']
0 France - Paris
1 USA - DC
2 UK - London
dtype: object
Same size and index, but index in different order. With row labels and automatic alignment, what we have is:
>>> df['country'] + ' - ' + df['capital'].sort_values()
0 France - Paris
1 USA - DC
2 UK - London
dtype: object
Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').
When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:
>>> df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']
0 France - Paris
1 NaN
2 UK - London
dtype: object
Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').
So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:
- Do we want row labels?
- Do we want automatic alignment?
- Do we want to automatically create
NArows if the index values don't match?
Thanks for the summary Marc. I think your three bullets perfectly capture the three levels to this issue.
I suppose there might be one more question: Do we leave binary operations between DataFrame objects out of the spec entirely? That sidesteps the issue of row labels & alignment. And if we do allow binary operations between
- DataFrame & scalars
- DataFrame & arrays (where an array is an unlabeled column of a dataframe. Only requirement is that the shape is compatible.)
then perhaps there isn't much of a loss in functionality?
On Fri, Jun 19, 2020 at 9:55 AM Marc Garcia [email protected] wrote:
Trying to structure a bit the discussion, this is how I see the different components of what is being discussed here (with an example):
import pandas>>> df = pandas.DataFrame({'country': ['France', 'USA', 'UK'], ... 'capital': ['Paris', 'DC', 'London']})>>> df country capital0 France Paris1 USA DC2 UK London
Basic case, same size, same index, in the same order (I guess whatever we do, it will work):
df['country'] + ' - ' + df['capital']0 France - Paris1 USA - DC2 UK - Londondtype: object
Same size and index, but index in different order. With row labels and automatic alignment, what we have is:
df['country'] + ' - ' + df['capital'].sort_values()0 France - Paris1 USA - DC2 UK - Londondtype: object
Without row labels (or without automatic alignment), I guess we would operate by row id, and rely on sorting for the alignment df.sort_values('country_id').
When the size of the dataframes is different, with automatic alignment, pandas fills with NA after aligning, and then operates:
df['country'] + ' - ' + df[df.capital.str.len() > 3]['capital']0 France - Paris1 NaN2 UK - Londondtype: object
Without row labels, I guess the best solution would probably be to fail if the size is different, and rely on a join / reindex to force the user to make the alignment explicitly df1 + df1.join(on='country_id', how='left').
So, correct me if I'm wrong, but I think the decisions that need to be made regarding alignment are:
- Do we want row labels?
- Do we want automatic alignment?
- Do we want to automatically create NA rows if the index values don't match?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata-apis/dataframe-api/issues/12#issuecomment-646682006, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQWVRNLGXYZWWVGVSTRXN37PANCNFSM4NUM7E3A .
For reference, it seems like vaex raises when the lengths of the "column" (expression) don't match
It seems like this doesn't come up for vaex, which AFAICT doesn't implement binary operators (can you confirm that @maartenbreddels?)
Yes, basically vaex does not have row labels, so both operations do not make sense in the current state. There is a branch which lets the dataframe behave like a 2d array (nep13/nep18), meaning implicit row labels that are row numbers. In the case of a binary operator it will ignore the column labels, and only use the column index, similar to a 2d array.