libgdf icon indicating copy to clipboard operation
libgdf copied to clipboard

[REVIEW] NULL support for hash-based group by

Open jrhemstad opened this issue 7 years ago • 8 comments

This PR adds support for NULL values to the hash-based groupby implementation.

The build_aggregation_table kernel was updated such that if a row is invalid, then it is not inserted into the hash table.

jrhemstad avatar Sep 15 '18 00:09 jrhemstad

Are there any tests that should be added here? Or is everything covered by existing tests?

scopatz avatar Sep 20 '18 18:09 scopatz

Code LGTM otherwise. Thanks for putting this in @jrhemstad!

scopatz avatar Sep 20 '18 18:09 scopatz

This requires changes on the PyGDF side in making sure to allocate validity buffers for all columns so please coordinate before merging.

kkraus14 avatar Sep 21 '18 14:09 kkraus14

This requires changes on the PyGDF side in making sure to allocate validity buffers for all columns so please coordinate before merging.

@kkraus14 is this handled in pygdf now? can we merge?

nsakharnykh avatar Sep 25 '18 21:09 nsakharnykh

@nsakharnykh I believe there's still issues with binaryops if we always always allocate a validity mask, please hold off for the time being.

kkraus14 avatar Sep 25 '18 21:09 kkraus14

@kkraus14 ping, are there still issues with binaryops?

nsakharnykh avatar Sep 27 '18 21:09 nsakharnykh

@nsakharnykh just pulled in latest changes and running tests now. I know sort based groupbys don't handle validity masks properly, but I believe binaryops were actually okay. I apologize for being mistaken earlier.

kkraus14 avatar Sep 27 '18 21:09 kkraus14

@nsakharnykh This PR needs to be merged first: https://github.com/gpuopenanalytics/pygdf/pull/246 Which will always allocate validity masks for you so you won't encounter a situation where validity masks aren't allocated and you need to create one.

kkraus14 avatar Sep 27 '18 21:09 kkraus14