`na.rm` or something similar for aggregations
Not sure if there is an existing ticket, so creating this. Adding a na.rm(similar to R) or skipna (pandas flavour) parameter in aggregations might be useful
I observed this in functions like first:
from datatable import dt, f, by, sort
from numpy import nan
DT = dt.Frame([{'A': 1, 'B': 'a', 'C': 5.0},
{'A': 1, 'B': 'a', 'C': nan},
{'A': 2, 'B': None, 'C': 4.0},
{'A': 2, 'B': 'b', 'C': 4.0},
{'A': 2, 'B': 'b', 'C': nan},
{'A': 3, 'B': 'c', 'C': 9.0},
{'A': 3, 'B': None, 'C': nan},
{'A': 3, 'B': 'd', 'C': nan},
{'A': 3, 'B': None, 'C': 9.0},
{'A': 4, 'B': 'e', 'C': 8.0},
{'A': 4, 'B': 'e', 'C': nan},
{'A': 5, 'B': None, 'C': 2.0},
{'A': 6, 'B': 'h', 'C': nan},
{'A': 6, 'B': 'h', 'C': nan}])
DT
| A B C
| int64 str32 float64
-- + ----- ----- -------
0 | 1 a 5
1 | 1 a NA
2 | 2 NA 4
3 | 2 b 4
4 | 2 b NA
5 | 3 c 9
6 | 3 NA NA
7 | 3 d NA
8 | 3 NA 9
9 | 4 e 8
10 | 4 e NA
11 | 5 NA 2
12 | 6 h NA
13 | 6 h NA
[14 rows x 3 columns]
# returns the exact first value, even if it is a null
DT[:, f.B.first(), 'A']
| A B
| int64 str32
-- + ----- -----
0 | 1 a
1 | 2 NA
2 | 3 c
3 | 4 e
4 | 5 NA
5 | 6 h
[6 rows x 2 columns]
# a workaround is with sort
# this might be limiting in scenarios where you do not need the sort
# or you have other columns and the sorting affects your output/analysis
DT[:, f.B.first(), 'A', sort(f.B, na_position='last')]
| A B
| int64 str32
-- + ----- -----
0 | 1 a
1 | 2 b
2 | 3 c
3 | 4 e
4 | 5 NA
5 | 6 h
[6 rows x 2 columns]
# a cleaner, more readable option, and hopefully performant would be sth similar:
DT[:, f.B.first(skipna=True), 'A']
| A B
| int64 str32
-- + ----- -----
0 | 1 a
1 | 2 b
2 | 3 c
3 | 4 e
4 | 5 NA
5 | 6 h
[6 rows x 2 columns]
Another option might be to change the semantics for first/last to show non-null values if they exist.
If you want to get the first non-missing value you can just do
DT[f.B != None, first(f.B)]
See https://datatable.readthedocs.io/en/latest/api/dt/first.html for more details.
Actually, it seems that first() and rowfirst() are not really consistent. Because the former returns the first value, but the latter — the first non-missing one.
DT[f.B != None, first(f.B)] works great for non grouped operations, but not for group by operations - I noticed that when writing the docs for the functions. ( I thought I raised an issue back then, but sadly I didn't)
Do you mean this error?
NotImplementedError: FExpr_Func::evaluate_iby() not implemented yet
yes, @oleksiyskononenko , but that error is a known one and we can work around it. my point is if we are combining first, last, with other aggregation functions, in a groupby context, we won't necessarily get the first non null value, and there are scenarios where sorting before picking the first function might not be helpful (sorting in datatable is powerful, I do feel though that for picking first non-null value, it is probably overkill)
Well, if we implement evaluate_iby() we could filter out missings in the presence of group by. But even now, can you just filter them out before you do a group by? Something like this DT1 = DT[f.B != None, :], then do first/groupby on DT1. Won't it work as you need?
@oleksiyskononenko , that would fail if there are multiple aggregations:
DT[:, [f.B.first(), f.C.sum()], 'A']
Out[284]:
| A B C
| int32 str32 float64
-- + ----- ----- -------
0 | 1 a 5
1 | 2 NA 8
2 | 3 c 18
3 | 4 e 8
4 | 5 NA 2
5 | 6 h 0
[6 rows x 3 columns]
In [285]: DT[f.B != None, :][:, [f.B.first(), f.C.sum()], 'A']
Out[285]:
| A B C
| int32 str32 float64
-- + ----- ----- -------
0 | 1 a 5
1 | 2 b 4
2 | 3 c 9
3 | 4 e 8
4 | 6 h 0
[5 rows x 3 columns]
The only workaround at the moment is to pre-sort :
In [286]: DT[:, [f.B.first(),f.C.sum()], 'A', sort(f.B, na_position='last')]
Out[286]:
| A B C
| int32 str32 float64
-- + ----- ----- -------
0 | 1 a 5
1 | 2 b 8
2 | 3 c 18
3 | 4 e 8
4 | 5 NA 2
5 | 6 h 0
[6 rows x 3 columns]
Having a skipna could allow us having to not sort (sorting is fast in datatable, it is still an expensive operation and avoiding it if possible would be great right?) and still get the correct output
I see, I thought that by "multiple aggregations" you mean something like agg1(agg2(f[:])). Yes, if you do aggregations for multiple columns then you can't easily filter missings out.
Actually, I have some doubts over the sorting approach. Look at this example:
DT = dt.Frame(A=[1,1,2,1,2], B=[None,4,5,2,3])
print(DT[:, f[:], by("A")])
gives you
| A B
| int32 int32
-- + ----- -----
0 | 1 NA
1 | 1 4
2 | 1 2
3 | 2 5
4 | 2 3
[5 rows x 2 columns]
So I assume that requesting the first non-missing by group should give you
| A B
| int32 int32
-- + ----- -----
0 | 1 4
1 | 2 5
[2 rows x 2 columns]
But with the sorting approach
print(DT[:, first(f[:]), by("A"), sort("B", na_position="last")])
it is
| A B
| int32 int32
-- + ----- -----
0 | 1 2
1 | 2 3
[2 rows x 2 columns]
My feeling is that our documentation here is kind of wrong and it is just a coincidence it gives the correct results for the grouped frame mentioned in the example.
Great catch @oleksiyskononenko ! When you can, can you have a look at #3310
We had it on the roadmap some time ago, but it was never completed. It is good that we have this issue now.