datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Do we need to escape search string as it's used in regexp? Wondering what's the result of `contains("abcdefg", ".*")`

Open alamb opened this issue 1 year ago • 6 comments

          Do we need to escape search string as it's used in regexp? Wondering what's the result of `contains("abcdefg", ".*")`

Originally posted by @waynexia in https://github.com/apache/datafusion/pull/10879#discussion_r1635767599

alamb avatar Jun 15 '24 16:06 alamb

cc @Lordworms

alamb avatar Jun 15 '24 16:06 alamb

Sorry for the late review since I was busy this week. In the beginning, I was just trying to keep the same format as other ScalarUDF which utilize arrow-rs methods to implement functionality so I just chose arrow::reglike. I can fix it to use str.contains

Lordworms avatar Jun 15 '24 22:06 Lordworms

I was just trying to keep the same format as other ScalarUDF which utilize arrow-rs methods to implement functionality so I just chose arrow::reglike.

Makes sense. I think in this case, however, the function shouldn't actually have regexp support, so it would be better to use str.contains

I can fix it to use str.contains

Thank you!

alamb avatar Jun 17 '24 09:06 alamb

Thanks @Lordworms. I'm thinking about @alamb's idea https://github.com/apache/datafusion/pull/10879#discussion_r1636789004, which only implements contains as a placeholder for translating/planning. And it would finally become some other thing like LIKE after the optimization phase. A similar thing is Expr::Wildcard, it's to reflect * from SQL but doesn't have a corresponding physical expr.

waynexia avatar Jun 18 '24 03:06 waynexia

One benefit of using LIKE is that it already has a highly optimized arrow implementation (as in will actually use substring if the patttern looks like substr% etc).

alamb avatar Jun 18 '24 10:06 alamb

I'll refactor it to use LIKE

Lordworms avatar Jun 27 '24 05:06 Lordworms