datatable Implement cumulative functions

[x] cumsum() https://github.com/h2oai/datatable/pull/3257
[x] cumprod() https://github.com/h2oai/datatable/pull/3304
[x] cummax() https://github.com/h2oai/datatable/pull/3288
[x] cummin() https://github.com/h2oai/datatable/pull/3288
[x] cumcount() https://github.com/h2oai/datatable/pull/3310
[x] ngroup() - not strictly cumulative https://github.com/h2oai/datatable/pull/3310
[x] forward fill https://github.com/h2oai/datatable/pull/3311
[x] backward fill https://github.com/h2oai/datatable/pull/3311
[ ] rank
[ ] rolling aggregations

May 14 '22 20:05 samukweku

I think these are very useful functions. Great to see them soon!

May 17 '22 14:05 ghost

These would be great and long due.

Is there an ETA of when this would be available (next release)?

May 28 '22 08:05 vopani

@vopani currently working on them ... cant specify an ETA though ... contributions are welcome also. You can test the cumsum function by downloading the latest dev version

May 28 '22 08:05 samukweku

@vopani Kindly create a minimal example of the cumsum for both datatable and pandas; it is easier to grok.

May 31 '22 08:05 samukweku

@samukweku Yeah apologies, let me run some more tests and share better examples / feedback.

Thanks again for these cumulative functions, they are super useful and help in bridging the gap to pandas.

May 31 '22 09:05 vopani

Thanks also to @oleksiyskononenko and @st-pasha for their guidance thru my C++ journey

May 31 '22 09:05 samukweku

tests comparison with np.cumsum, where numpy is twice as fast, and seems to be run on a single core:

import numpy as np
from datatable import dt, f

In [12]: a = np.arange(10_000_000)

In [13]: %timeit a.cumsum()
22.2 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: DT = dt.Frame(a)

In [15]: %timeit DT[:, f.C0.cumsum()]
46.7 ms ± 1.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Jun 10 '22 23:06 samukweku

@samukweku we need to address https://github.com/h2oai/datatable/issues/3081 to improve performance in the case when there is no group-by context. For grouped frames we're fully parallel now.

For cumulative functions we actually don't have a separate issue and I'm not sure if we can call cumsum() etc to be reducers. But anyways, our logic is now parallel in terms of the groups, but not in terms of the actual data. We probably need to improve it.

There is still a question why numpy is x2 faster on a single core. We need to do some profiling to see if there are any bottlenecks in our code.

Jun 10 '22 23:06 oleksiyskononenko

thanks @oleksiyskononenko , maybe you can explain more what you mean by parallelisation in terms of the actual data. How is profiling done in C++? by the way, is there any way to do interactive work in C++?

Jun 10 '22 23:06 samukweku

@oleksiyskononenko I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

Jun 17 '22 03:06 samukweku

@samukweku

maybe you can explain more what you mean by parallelisation in terms of the actual data.

It means that we parallelize loops to go over the frame rows, currently our parallel loops for cumulative functions/reducers go over the groups.

How is profiling done in C++?

To start with you can read couple of threads on SO, something like https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux

by the way, is there any way to do interactive work in C++?

Probably yes, I don't have experience with that.

I was reading up on cumulative sum, and found a possible performance option with Fenwick tree. What are your thoughts on it? is it worth the effort?

First, let's looks through the code to see if there are any obvious bottlenecks. Second, we should decide if the cumsum() performance is acceptable or not. Is it currently x2 slower than numpy? Then we could discuss ways to improve performance if needed.

As an aside, when we call get_edit_datatable, does it return a vector? where do I trace this to see its implementation

We don't have get_edit_datatable() method. Perhaps you mean get_data_editable()? It returns a pointer to the column's underlying buffers. Implementation depends on the column type. See, for instance: https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_fw.cc#L234-L238 https://github.com/h2oai/datatable/blob/main/src/core/column/sentinel_str.cc#L118-L122

I'm also thinking of building on your example of templates and convert the cumsum implementation to support cumprod, since they are essentially the same, just the change of the operator.

Sounds like a great idea. Btw, one more difference between cumsum() and cumprod() is the default value: 0 vs 1.

Jun 17 '22 19:06 oleksiyskononenko

@oleksiyskononenko @Peter-Pasta @vopani should we stick to the name cumcount or use row_number. I bring up row_number because of sql. checking to see which one is more intuitive

Jun 30 '22 00:06 samukweku

@samukweku What functionality you want to achieve with this function?

Jun 30 '22 04:06 oleksiyskononenko

@oleksiyskononenko it is essentially row numbers, but it is more helpful when you want the row number per group ... #2892 is the reason behind the cumulative functions. In pandas it is referred to as cumcount, in datatable it can be pulled off with 1..N or something similar ... in Postgres/sql it is usually referred to as row_number. I hope I explained well enough?

Jun 30 '22 05:06 samukweku

cumcount is a very natural (and common) term used by Data Scientists. Which is also why pandas uses the same.

Jun 30 '22 06:06 vopani

Thanks for the feedback @vopani we are on the right track then with naming. I will go ahead and create the function with cumcount as the name

Jun 30 '22 07:06 samukweku

Actually, in datatable there is already a function called count() that is used to

Calculate the number of non-missing values for each column

see https://datatable.readthedocs.io/en/latest/api/dt/count.html for more details.

So the function named cumcount() is expected to be a cumulative equivalent of count(). But it will not be in this case. My feeling is that if we call a function smthcount() it is expected to count something, while in reality this function only returns the corresponding row id's that there is no need to count, they're already known internally.

Jun 30 '22 08:06 oleksiyskononenko

The datatable count() is very unnatural and unintuitive, which should ideally be renamed :-) Even SQL count means #rows.

But ok, for the benefit of this discussion, maybe using or cumrows() or cumrowcount() might be suitable. We can also have the cumcount() that is the cumulative equivalent of count(). It is a useful aggregation but many users might misinterpret its result initially. Hopefully sufficient documentation will help.

Jun 30 '22 09:06 vopani

I think count depending on its usage can also count non missing values.

I feel using cumcount will be familiar with users from pandas and data science generally. But row_number feels very explicit Unless there is a more intuitive name. cumrowcount seems odd to me.

Jun 30 '22 10:06 samukweku

ROW_NUMBER function is a SQL ranking function that assigns a sequential rank number to each new record in a partition.

That's the definition from SQL server

Jun 30 '22 10:06 samukweku

cumcount - Number each item in each group from 0 to the length of that group - 1.

Definition from pandas

Jun 30 '22 10:06 samukweku

@vopani Well, I'm not sure why you think it is an unnatural and unintuitive name. The same name/behavior is used in, at least, pandas and pyarrow. datatable just sticks to the same name and convention.

What I find unintuitive though, is pandas using the name count() to calculate non-missing values and cumcount() to produce a simple row counter that doesn't care about the element's validity.

As for the function name, I'm not sure what could be the good name at this point. We probably need to review what others like pandas, numpy, pyarrow, data.table, etc. are doing to have a good guess.

Jun 30 '22 10:06 oleksiyskononenko

The same name/behavior is used in, at least, pandas and pyarrow. datatable just sticks to the same name and convention.

Yes, this complaint is across libraries actually.

Well, I'm not sure why you think it is an unnatural and unintuitive name.

(Being a teacher) Most of my students keep getting it wrong (without realizing) because it implements count with a condition without being explicit.

Anyway, numpy doesn't have a direct implementation and is most verbose, though it will likely be implemented in the count_nonzero format. Currently, you'd need to do a np.count_nonzero(~np.isnan(data)) or use np.sum().

I'd still recommend cumcount or cumrows for the function name but feel free to explore more options / ideas. I'm just keen to use the aggregation irrespective of the name as it has wide use cases in many dataset operations.

Jun 30 '22 10:06 vopani

I guess it all depends on the definition. If we define count() as a function to count values, then obviously it should skip missing values — and then this is not an "implicit condition" anymore. Some people may think that this function should count rows instead, and this is where the confusion comes from.

The name probably originated from pandas and that's good that we match the name. May be we need to call the new function cumcount() for the same reason.

Jun 30 '22 22:06 oleksiyskononenko

blank

Jul 05 '22 11:07 samukweku

@samukweku I guess

DT[:, dt.cummax(f[:]), by('D')]

and

df.groupby('D')[['A','C']].cummax()

are doing different things. Just compare the results

      | D          A     B        C
      | str32  int32  void  float64
----- + -----  -----  ----  -------
    0 | a          2    NA    5.4  
    1 | a          2    NA    5.4  
    2 | a          2    NA    5.4  
    3 | a          2    NA    5.4  
    4 | a          2    NA    5.4  
    5 | a          2    NA    5.4  
    6 | a          2    NA    5.4  
    7 | a          2    NA    5.4  
    8 | a          2    NA    5.4  
    9 | a          2    NA    5.4  
   10 | a          2    NA    5.4  
   11 | a          2    NA    5.4  
   12 | a          2    NA    5.4  
   13 | a          2    NA    5.4  
   14 | a          2    NA    5.4  
    … | …          …     …        …
49995 | b          5    NA    4.323
49996 | b          5    NA    4.323
49997 | b          5    NA    4.323
49998 | b          5    NA    4.323
49999 | b          5    NA    4.323
[50000 rows x 4 columns]

and

         A      C
0      2.0  5.400
1      NaN  5.400
2      5.0  2.200
3      5.0  4.323
4      5.0  4.323
...    ...    ...
49995  2.0  5.400
49996  NaN  5.400
49997  5.0  4.323
49998  5.0  4.323
49999  5.0  4.323

[50000 rows x 2 columns]

While datatable really calculates cummax() within a and b groups, pandas is doing something different (basically the same, but doesn't really sort and re-arrange rows).

Btw, datatable performance is a factor of 20 better for me:

In [54]: %timeit DT[:, dt.cummax(f[:]), by('D')]
6.63 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Are you running datatable in a debug or release mode?

Jul 05 '22 17:07 oleksiyskononenko

Thanks @oleksiyskononenko . How can I tell what mode I'm running it in? I just ran it in my terminal. What mode should I run it in? How do I switch between modes?

If you don't mind can you add the pandas timeit output as well.

Jul 05 '22 22:07 samukweku

@samukweku It all depends on how you build the code, you could either do make build or make debug: https://datatable.readthedocs.io/en/latest/start/install.html#install-datatable-in-editable-mode To test performance, you need to build it in the release mode.

Yes, here is pandas — it is comparable to what you posted:

3.21 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Anyways, if you look at the output I attached above you will see that the groupby is doing different things in pandas vs dt. In pandas it doesn't re-arrange the rows, but only creates groups internally. While in datatable, the result is a grouped frame with the rows being re-arranged into the corresponding groups.

What it means is that we're comparing different things here and not purely the cummax() performance. The cummax() performance could be compared in the case when there is no groupby. For this case on my end I'm getting comparable results with dt being slightly faster. Both are running single-threaded though, dt obviously could run in parallel — but that's something we have discussed on a separate issue.

Jul 05 '22 22:07 oleksiyskononenko

thanks for the feedback @oleksiyskononenko ... regarding the groupby semantics, is there a ticket for not rearranging the groups, as I believe it will give more performance? maybe a sort = False argument like Pandas maybe?

Jul 05 '22 23:07 samukweku

@samukweku I guess there is a ticket https://github.com/h2oai/datatable/issues/1070

Jul 05 '22 23:07 oleksiyskononenko