datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add option to FilterExec to prevent re-using input batches

Open andygrove opened this issue 1 year ago • 3 comments

Which issue does this PR close?

N/A

Rationale for this change

DataFusion Comet is currently maintaining a fork of FilterExec with a small modificiation to change the way that filtered batches are created. We have a requirement that we do not want FilterExec to pass through input batches in the case where the predicate evaluates to true for all rows in a batch (due to some array re-use in our scan).

We would like to make the DataFusion implementation of FilterExec customizable to meet our needs.

What changes are included in this PR?

Add a new boolean parameter so that we can choose whether FilterExec is allowed to return unmodified input batches.

Are these changes tested?

I did not add tests yet. I wanted to get some feedback on approach first.

Are there any user-facing changes?

andygrove avatar Aug 16 '24 22:08 andygrove

If the predicate evaluation is entirely true, it typically results in an array pointer copy. However, there are instances where you might want to copy the underlying data even if the predicate is entirely true, even if it degrades the performance of the operator.

Is there a use case other than Comet itself?

metegenez avatar Aug 17 '24 11:08 metegenez

Marking as draft as I think this PR is no longer waiting on feedback.

alamb avatar Aug 20 '24 18:08 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 20 '24 02:10 github-actions[bot]