datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Is pre-compile pattern string in regexp_match operation

Open zhuliquan opened this issue 1 year ago • 3 comments

Is your feature request related to a problem or challenge?

I noticed below code: https://github.com/apache/datafusion/blob/6a4a280e3cf70fe5f1a1cfe7c2de13e4c39f89bb/datafusion/physical-expr/src/expressions/binary.rs#L537-L564 This looks like every time record_batch is evaluated, it will execute the compiled pattern string and use the compiled results to match arrow-array

Describe the solution you'd like

when building binary physical expr , we can pre-compile pattern string if op is regex_match

Describe alternatives you've considered

No response

Additional context

No response

zhuliquan avatar Jun 27 '24 16:06 zhuliquan

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for https://github.com/apache/datafusion/issues/8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

alamb avatar Jun 27 '24 21:06 alamb

Hi @zhuliquan -- I believe you are correct that the regexp is re-compiled for each batch. This is mentioned as one of the use cases for #8051

I think it would be a nice improvement to look into pre-compiling the regexp somehow, though last time I checked the only benchmark we have (one of the clickbench queries) compiling the regexp was not a significant consumer of time

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch. https://github.com/apache/datafusion/blob/57280e42dc2391ab65c24c0fb52032942d3d85a8/datafusion/physical-expr/src/expressions/binary.rs#L60-L66

zhuliquan avatar Jun 28 '24 01:06 zhuliquan

I think should pre-compiled when create binary-physical-expr, instead of evaluating record_batch.

I think this makes sense to me. Thank you

One thing to look into might be to replace the regexp match operator with a function (and have the function's implementation store the precompiled regexp)

One tricky part might be serialization (in datafusion-proto). We can probably handle it via an extension codec or something.

alamb avatar Jun 28 '24 10:06 alamb