datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: Add `ProgressiveEval` operator

Open NGA-TRAN opened this issue 1 year ago • 6 comments

Which issue does this PR close?

Closes https://github.com/apache/datafusion/issues/10488

Rationale for this change

In InfluxDB IOx, when the inputs of SortPreservingMerge are all sorted on the sort key and their data do not overlap, we replace SortPreservingMerge with ProgressiveEval which:

  1. Avoids starting all input streams at once
  2. Avoids having to compare any keys (doesn't actually do a merge)

We wrote about using this operator here: https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/

What changes are included in this PR?

Adding a new ProgressiveEvalExec

Are these changes tested?

Yes, unit tests

Are there any user-facing changes?

Not yet because it is not hooked in any plans yet

NGA-TRAN avatar May 13 '24 18:05 NGA-TRAN

@alamb : The PR is ready for review

NGA-TRAN avatar May 13 '24 18:05 NGA-TRAN

What is the plan for actually using this code in DataFusion?

I am hesitant to put code in DataFusion that is not useful to other users of DataFusion as that basically adds maintenance burden to the community for no benefit.

We should at least have a plan / POC / description of how we are going to use this operator in plans.

I think perhaps we can use the code added by @suremarc in thttps://github.com/https://github.com/apache/datafusion/issues/7490 to determine conditions when the optimizer can use ProgressiveEval instead of SortPreservingMerge

@alamb : My first step after this is merged in to use it in IOx. So it will be used but I understand it won't be visible to DF.

Then I am happy to look into https://github.com/apache/datafusion/issues/7490 to see if there is a quick change to use ProgressiveEval instead of SortPreservingMerge

NGA-TRAN avatar May 13 '24 19:05 NGA-TRAN

@alamb I have just looked at the code how IOx use ProgressiveEval and we can make it general. So next steps will be:

  1. After this is merged, I will upgrade it to IOx and use ProgressiveEval in IOx
  2. Port what I do in (1) to DF

This means the optimization will be fully in DF, not in IOx at all

NGA-TRAN avatar May 13 '24 19:05 NGA-TRAN

closing/reopening to retrigger CI checks that have gotten stuck for some reason

alamb avatar May 14 '24 12:05 alamb

@alamb and I have chatted and even though the ProgressesiveEval operator in this PR is fully implemented and tested, it is not integrated into any query plan yet. The usage of this operator in InfluxDB IOx is very specific to Influx. To make it more general to DF, it will need more investigation. Thus we decided not merge this PR until someone is interested in intergrating it in DF query plans

NGA-TRAN avatar May 14 '24 16:05 NGA-TRAN

Follow on / next steps here: https://github.com/apache/datafusion/issues/10316#issuecomment-2113301756

Marking this PR as draft until someone is ready to hook it in to the optimzier

alamb avatar May 15 '24 19:05 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 15 '24 01:07 github-actions[bot]