feat: Add `ProgressiveEval` operator
Which issue does this PR close?
Closes https://github.com/apache/datafusion/issues/10488
Rationale for this change
In InfluxDB IOx, when the inputs of SortPreservingMerge are all sorted on the sort key and their data do not overlap, we replace SortPreservingMerge with ProgressiveEval which:
- Avoids starting all input streams at once
- Avoids having to compare any keys (doesn't actually do a merge)
We wrote about using this operator here: https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/
What changes are included in this PR?
Adding a new ProgressiveEvalExec
Are these changes tested?
Yes, unit tests
Are there any user-facing changes?
Not yet because it is not hooked in any plans yet
@alamb : The PR is ready for review
What is the plan for actually using this code in DataFusion?
I am hesitant to put code in DataFusion that is not useful to other users of DataFusion as that basically adds maintenance burden to the community for no benefit.
We should at least have a plan / POC / description of how we are going to use this operator in plans.
I think perhaps we can use the code added by @suremarc in thttps://github.com/https://github.com/apache/datafusion/issues/7490 to determine conditions when the optimizer can use ProgressiveEval instead of SortPreservingMerge
@alamb : My first step after this is merged in to use it in IOx. So it will be used but I understand it won't be visible to DF.
Then I am happy to look into https://github.com/apache/datafusion/issues/7490 to see if there is a quick change to use ProgressiveEval instead of SortPreservingMerge
@alamb I have just looked at the code how IOx use ProgressiveEval and we can make it general. So next steps will be:
- After this is merged, I will upgrade it to IOx and use
ProgressiveEvalin IOx - Port what I do in (1) to DF
This means the optimization will be fully in DF, not in IOx at all
closing/reopening to retrigger CI checks that have gotten stuck for some reason
@alamb and I have chatted and even though the ProgressesiveEval operator in this PR is fully implemented and tested, it is not integrated into any query plan yet. The usage of this operator in InfluxDB IOx is very specific to Influx. To make it more general to DF, it will need more investigation. Thus we decided not merge this PR until someone is interested in intergrating it in DF query plans
Follow on / next steps here: https://github.com/apache/datafusion/issues/10316#issuecomment-2113301756
Marking this PR as draft until someone is ready to hook it in to the optimzier
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.