Support complex filter before merge procedure
Describe This Problem
A filter procedure according to the query predicates will be applied to the record batch stream from sst before feeding the batches to the merge iterator. However, the filter only supports a very simple form -- anded binary expression, so it doesn't work if the query predicate is complex, e.g. where (hostname = '127.0.0.1' or hostname = '192.168.0.2') and timestamp between 'xxxx' and 'xxxx'.
Proposal
The crucial point here is how to make the filter procedure can support complex predicate expressions, and basically there are two approaches to this target:
- Utilize
datafusion; - Implement the filter logic manually;
And I vote for the first approach, but we have to figure out how to utilize datafusion to implement the filter logic.
Additional Context
No response
TSBS is added to CI, we can use it to compare performance before/after fix this issue
- https://github.com/CeresDB/ceresdb/actions/runs/3102504402#summary-8485564354
To utilize datafusion, we can do:
- Create PhysicalExpr from LogicalExpr via
create_physical_expr. - Implement filter logic like
FilterExecStreamdo in datafusion.
create_physical_expr: https://github.com/apache/arrow-datafusion/blob/45fc415daa7028559ef3477e53a184a114149f9e/datafusion/physical-expr/src/planner.rs#L42
FilterExecStream: https://github.com/apache/arrow-datafusion/blob/45fc415daa7028559ef3477e53a184a114149f9e/datafusion/core/src/physical_plan/filter.rs#L180
Maybe I can help do this task :D.
It will be appreciated if you volunteer to help.
@ygf11 I have updated the code location about the filtering procedure, and I hope it will help: https://github.com/CeresDB/ceresdb/blob/43a84ba3c2ddcee69906e70322060b6dc4e91ddc/analytic_engine/src/row_iter/record_batch_stream.rs#L137
I have updated the code location about the filtering procedure, and I hopes it will help.
Thanks for reminding, it helps a lot.