Regex support?
First, thank you for great library!
Do you plan (or when do you plan) to support regex manipulations in polars?
Example would be, build a new column, which extracts multiple matches from another column using regex expression and combining output. Is it possible currently archive in another way?
I would be open to contribute them with some guidance.
Isn't it possible doing so using Elixir's already existing regex capabilities?
I don't know if that's exactly what you want, but I think you might write a function that returns the desired value after evaluating a cell with Regex.match?/2, and then you could map this function over an existing column, and using Explorer.DataFrame.mutate it is possible create a new one.
You can use Series.transform with Elixir regexes but that requires materializing all results to Elixir and back, which is not the most efficient format.
https://pola-rs.github.io/polars-book/user-guide/howcani/data/strings.html
So this one is certainly possible but opens up a bit of a can of worms. In the Python Polars, there is a string namespace. I think we could totally do that. Something like Series.String.lengths or Series.String.filter. There's a regex builder re-exported by polars so that's also pretty straightforward. So I guess it's mostly a design decision. As someone who works with a lot of text, I think it'd be very useful. @philss @josevalim what do you think about something like Series.String? Is there another option I'm not seeing? The only pause I have is whether we'd want to then namespace other things. E.g. Series.List in support of #296?
Honestly, I am not the biggest fan of the namespacing, it feels everything could fit in a single namespace and in the worst case we use string_ prefixes? It looks like less to manage and closer to SQL too?
Sure, I'm just a bit wary of confusing verbiage but agreed it's best to give it a try in the single namespace first and if it feels painful look at string_ prefixes.
I'm planning to close this issue this week. We now have two new functions to work with regexes: re_contains/2 and re_replace/3 - see #894.
I would like to add 4 more functions:
-
Series.count_matches/2mirroring Python Polars'Series.count_matches/2, but only for literals. -
Series.re_count_matches/2which is the same function from above, but with regex support. -
Series.extract_all/2mirroring the function of the same name in Polars, but this time it only accepts a string representing a regex. The resultant series is of type{:list, :string}. -
Series.extract_groups/2that also only accepts a string representing a regex. It mirrors the function of the same name inPolars. The resultant dtype is{:struct, [{string(), :string}]}, with the key names as the group indexes or names.
Please let me know if you see any problems with this, or if you think we could name them differently.
Awesome work, @philss!
What about re_extract_all and re_extract_groups to hint that the input is a regex? Do we have other functions which accept regex but don't have that prefix?
@billylanchantin Thanks! :D And yeah, I think it would be nice to stick to the pattern and use re_ to hint that the input is a regex. I will use that.
Do we have other functions which accept regex but don't have that prefix?
No, we don't have. All of the others accept literals.
Agreed on the proposed new functions!
Btw, Elixir has scan (extract_all) and named_captures (extract_groups). Should we name them re_scan and re_named_captures to mirror Elixir?
@josevalim yeah, I agree we can mirror Elixir. re_scan and re_named_captures will be.
I'm facing a little problem with the re_named_captures implementation that is: we need to calculate the dtype, but we cannot do it without run the regex against the backend. This is not a problem for the eager backend, but it is a problem when we are using a "lazy series". I'm thinking of having an optional "names" argument that is required for the lazy series, and we always name the fields following these names. The downside is that people may need to repeat the names inside the regex and outside it. WDYT?
It would like like this:
DF.mutate(df, [parts: re_named_captures(a, "(a|b)=([0-9]+)"), ["key", "value"])])
@philss Rust regex crate should have a function that extract the names of a regex. Could we perhaps use it? Or is the issue that we don't call the lazy backend at all when building a query?
Or is the issue that we don't call the lazy backend at all when building a query?
@josevalim yeah, we don't touch the lazy backend at the moment we build the query. So we cannot (or we should not) call the backend at this point.
Let’s skip this function for now then. Eventually we should have a way of calling the backend to get this info.
I will try one last thing, and if it does not work, I will give up for now.