Oleksiy Kononenko
Oleksiy Kononenko
@samukweku we need to address https://github.com/h2oai/datatable/issues/3081 to improve performance in the case when there is no group-by context. For grouped frames we're fully parallel now. For cumulative functions we actually...
@samukweku >maybe you can explain more what you mean by parallelisation in terms of the actual data. It means that we parallelize loops to go over the frame rows, currently...
@samukweku What functionality you want to achieve with this function?
Actually, in datatable there is already a function called `count()` that is used to >Calculate the number of non-missing values for each column see https://datatable.readthedocs.io/en/latest/api/dt/count.html for more details. So the...
@vopani Well, I'm not sure why you think it is an unnatural and unintuitive name. The same name/behavior is used in, at least, [pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html) and [pyarrow](https://arrow.apache.org/docs/python/generated/pyarrow.compute.count.html). datatable just sticks to...
I guess it all depends on the definition. If we define `count()` as a function to count values, then obviously it should skip missing values — and then this is...
@samukweku I guess ```python DT[:, dt.cummax(f[:]), by('D')] ``` and ```python df.groupby('D')[['A','C']].cummax() ``` are doing different things. Just compare the results ```python | D A B C | str32 int32 void...
@samukweku It all depends on how you build the code, you could either do `make build` or `make debug`: https://datatable.readthedocs.io/en/latest/start/install.html#install-datatable-in-editable-mode To test performance, you need to build it in the...
@samukweku I guess there is a ticket https://github.com/h2oai/datatable/issues/1070
Actually, even ```python DT[:, :, sort(f[:], reverse=[True, False])] ``` will error as ```python ValueError: Mismatch between the number of columns (ncols=1) to be sorted and number of elements (nflags=2) in...