datatable icon indicating copy to clipboard operation
datatable copied to clipboard

`na.rm` or something similar for aggregations

Open samukweku opened this issue 3 years ago • 9 comments

Not sure if there is an existing ticket, so creating this. Adding a na.rm(similar to R) or skipna (pandas flavour) parameter in aggregations might be useful

I observed this in functions like first:

from datatable import dt, f, by, sort
from numpy import nan

DT = dt.Frame([{'A': 1, 'B': 'a', 'C': 5.0},
 {'A': 1, 'B': 'a', 'C': nan},
 {'A': 2, 'B': None, 'C': 4.0},
 {'A': 2, 'B': 'b', 'C': 4.0},
 {'A': 2, 'B': 'b', 'C': nan},
 {'A': 3, 'B': 'c', 'C': 9.0},
 {'A': 3, 'B': None, 'C': nan},
 {'A': 3, 'B': 'd', 'C': nan},
 {'A': 3, 'B': None, 'C': 9.0},
 {'A': 4, 'B': 'e', 'C': 8.0},
 {'A': 4, 'B': 'e', 'C': nan},
 {'A': 5, 'B': None, 'C': 2.0},
 {'A': 6, 'B': 'h', 'C': nan},
 {'A': 6, 'B': 'h', 'C': nan}])

DT
   |     A  B            C
   | int64  str32  float64
-- + -----  -----  -------
 0 |     1  a            5
 1 |     1  a           NA
 2 |     2  NA           4
 3 |     2  b            4
 4 |     2  b           NA
 5 |     3  c            9
 6 |     3  NA          NA
 7 |     3  d           NA
 8 |     3  NA           9
 9 |     4  e            8
10 |     4  e           NA
11 |     5  NA           2
12 |     6  h           NA
13 |     6  h           NA
[14 rows x 3 columns]

# returns the exact first value, even if it is a null
DT[:, f.B.first(), 'A']

   |     A  B
   | int64  str32
-- + -----  -----
 0 |     1  a
 1 |     2  NA
 2 |     3  c
 3 |     4  e
 4 |     5  NA
 5 |     6  h
[6 rows x 2 columns]

# a workaround is with sort
# this might be limiting in scenarios where you do not need the sort
# or you have other columns and the sorting affects your output/analysis
DT[:, f.B.first(), 'A', sort(f.B, na_position='last')]
   |     A  B
   | int64  str32
-- + -----  -----
 0 |     1  a
 1 |     2  b
 2 |     3  c
 3 |     4  e
 4 |     5  NA
 5 |     6  h
[6 rows x 2 columns]

# a cleaner, more readable option, and hopefully performant would be sth similar:
DT[:, f.B.first(skipna=True), 'A']

   |     A  B
   | int64  str32
-- + -----  -----
 0 |     1  a
 1 |     2  b
 2 |     3  c
 3 |     4  e
 4 |     5  NA
 5 |     6  h
[6 rows x 2 columns]

Another option might be to change the semantics for first/last to show non-null values if they exist.

samukweku avatar Jul 06 '22 11:07 samukweku

If you want to get the first non-missing value you can just do

DT[f.B != None, first(f.B)]

See https://datatable.readthedocs.io/en/latest/api/dt/first.html for more details.

Actually, it seems that first() and rowfirst() are not really consistent. Because the former returns the first value, but the latter — the first non-missing one.

oleksiyskononenko avatar Jul 06 '22 20:07 oleksiyskononenko

DT[f.B != None, first(f.B)] works great for non grouped operations, but not for group by operations - I noticed that when writing the docs for the functions. ( I thought I raised an issue back then, but sadly I didn't)

samukweku avatar Jul 06 '22 23:07 samukweku

Do you mean this error?

NotImplementedError: FExpr_Func::evaluate_iby() not implemented yet

oleksiyskononenko avatar Jul 07 '22 00:07 oleksiyskononenko

yes, @oleksiyskononenko , but that error is a known one and we can work around it. my point is if we are combining first, last, with other aggregation functions, in a groupby context, we won't necessarily get the first non null value, and there are scenarios where sorting before picking the first function might not be helpful (sorting in datatable is powerful, I do feel though that for picking first non-null value, it is probably overkill)

samukweku avatar Jul 07 '22 04:07 samukweku

Well, if we implement evaluate_iby() we could filter out missings in the presence of group by. But even now, can you just filter them out before you do a group by? Something like this DT1 = DT[f.B != None, :], then do first/groupby on DT1. Won't it work as you need?

oleksiyskononenko avatar Jul 07 '22 18:07 oleksiyskononenko

@oleksiyskononenko , that would fail if there are multiple aggregations:

DT[:, [f.B.first(), f.C.sum()], 'A']
Out[284]:
   |     A  B            C
   | int32  str32  float64
-- + -----  -----  -------
 0 |     1  a            5
 1 |     2  NA           8
 2 |     3  c           18
 3 |     4  e            8
 4 |     5  NA           2
 5 |     6  h            0
[6 rows x 3 columns]

In [285]: DT[f.B != None, :][:, [f.B.first(), f.C.sum()], 'A']
Out[285]:
   |     A  B            C
   | int32  str32  float64
-- + -----  -----  -------
 0 |     1  a            5
 1 |     2  b            4
 2 |     3  c            9
 3 |     4  e            8
 4 |     6  h            0
[5 rows x 3 columns]

The only workaround at the moment is to pre-sort :

In [286]: DT[:, [f.B.first(),f.C.sum()], 'A', sort(f.B, na_position='last')]
Out[286]:
   |     A  B            C
   | int32  str32  float64
-- + -----  -----  -------
 0 |     1  a            5
 1 |     2  b            8
 2 |     3  c           18
 3 |     4  e            8
 4 |     5  NA           2
 5 |     6  h            0
[6 rows x 3 columns]

Having a skipna could allow us having to not sort (sorting is fast in datatable, it is still an expensive operation and avoiding it if possible would be great right?) and still get the correct output

samukweku avatar Jul 08 '22 02:07 samukweku

I see, I thought that by "multiple aggregations" you mean something like agg1(agg2(f[:])). Yes, if you do aggregations for multiple columns then you can't easily filter missings out.

Actually, I have some doubts over the sorting approach. Look at this example:

DT = dt.Frame(A=[1,1,2,1,2], B=[None,4,5,2,3])
print(DT[:, f[:], by("A")])

gives you

   |     A      B
   | int32  int32
-- + -----  -----
 0 |     1     NA
 1 |     1      4
 2 |     1      2
 3 |     2      5
 4 |     2      3
[5 rows x 2 columns]

So I assume that requesting the first non-missing by group should give you

   |     A      B
   | int32  int32
-- + -----  -----
 0 |     1      4
 1 |     2      5
[2 rows x 2 columns]

But with the sorting approach

print(DT[:, first(f[:]), by("A"), sort("B", na_position="last")])

it is

   |     A      B
   | int32  int32
-- + -----  -----
 0 |     1      2
 1 |     2      3
[2 rows x 2 columns]

My feeling is that our documentation here is kind of wrong and it is just a coincidence it gives the correct results for the grouped frame mentioned in the example.

oleksiyskononenko avatar Jul 08 '22 20:07 oleksiyskononenko

Great catch @oleksiyskononenko ! When you can, can you have a look at #3310

samukweku avatar Jul 09 '22 00:07 samukweku

We had it on the roadmap some time ago, but it was never completed. It is good that we have this issue now.

oleksiyskononenko avatar Sep 29 '22 22:09 oleksiyskononenko