datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[Epic] Unify `WindowFunction` Interface (remove built in list of `BuiltInWindowFunction` s)

Open alamb opened this issue 2 years ago • 5 comments

Is your feature request related to a problem or challenge?

For many of the same reasons as listed on https://github.com/apache/arrow-datafusion/issues/8045, having two types of aggregate functions ("built in" --BuiltInWindowFunction and WindowUDF is problematic for two reasons:

  1. There are some features that may not be available to User Defined Window Functions (such as reversing FIRST_VALUE and LAST_VALUE)
  2. Users can not easily choose which window functions to include (which will likely be especially problematic as we work to add more functions)

Describe the solution you'd like

I propose moving DataFusion to only use WindowURFs and remove BuiltInWindowFunction for the same reasons as https://github.com/apache/arrow-datafusion/issues/8045

We will keep the existing WindowUDF interface as much as possible, while also potentially providing an easier way to define them.

Describe alternatives you've considered

Additional context

Proposed implementation steps:

  • [x] https://github.com/apache/arrow-datafusion/issues/8711
  • [x] https://github.com/apache/arrow-datafusion/issues/8734
  • [x] https://github.com/apache/arrow-datafusion/issues/9527
  • [x] https://github.com/apache/datafusion/issues/12029
  • [x] https://github.com/apache/datafusion/pull/11287
  • [x] https://github.com/apache/datafusion/issues/12373
  • [x] https://github.com/apache/datafusion/pull/12030
  • [x] https://github.com/apache/datafusion/issues/12648
  • [ ] https://github.com/apache/datafusion/issues/12649
  • [ ] https://github.com/apache/datafusion/issues/12694
  • [ ] https://github.com/apache/datafusion/issues/12695
  • [x] https://github.com/apache/datafusion/issues/12802

alamb avatar Jan 01 '24 18:01 alamb

I'd like to try a small POC and migrate ROW_NUMBER to WindowUDF trait

comphead avatar Mar 19 '24 21:03 comphead

I'd like to try a small POC and migrate ROW_NUMBER to WindowUDF trait

That would be awesome. Thank you

I recommend trying to put it in its own crate if possible (datfusion-window-functions perhaps?) but that doesn't have to be part of the POC

alamb avatar Mar 21 '24 17:03 alamb

I'm going to move back to working on datafusion-python for a bit, but would like to work on this afterwards.

Also I think we have some functions that are implemented in aggregate that should be moved to window. Namely first_value and last_value but it's worth taking a look through all the functions to see where they are defined.

Notes (for myself) on how postgresql divides them:

  • https://www.postgresql.org/docs/current/functions-aggregate.html
  • https://www.postgresql.org/docs/current/functions-window.html

timsaucer avatar Jul 24 '24 12:07 timsaucer

Thank you @jcsherin -- I added it tothe list above

alamb avatar Oct 01 '24 11:10 alamb

this has all tasks done 🚀 ok to close?

findepi avatar Nov 16 '24 18:11 findepi

Sounds good!

comphead avatar Nov 18 '24 17:11 comphead

There is one final small cleanup I think that could help. Filed it as

  • https://github.com/apache/datafusion/issues/13473

alamb avatar Nov 18 '24 20:11 alamb

Thanks for keeping this clean @findepi and @comphead

alamb avatar Nov 18 '24 20:11 alamb