explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Regex support?

Open liveforeverx opened this issue 3 years ago • 5 comments

First, thank you for great library!

Do you plan (or when do you plan) to support regex manipulations in polars?

Example would be, build a new column, which extracts multiple matches from another column using regex expression and combining output. Is it possible currently archive in another way?

I would be open to contribute them with some guidance.

liveforeverx avatar Sep 14 '22 14:09 liveforeverx

Isn't it possible doing so using Elixir's already existing regex capabilities?

I don't know if that's exactly what you want, but I think you might write a function that returns the desired value after evaluating a cell with Regex.match?/2, and then you could map this function over an existing column, and using Explorer.DataFrame.mutate it is possible create a new one.

kimjoaoun avatar Nov 19 '22 08:11 kimjoaoun

You can use Series.transform with Elixir regexes but that requires materializing all results to Elixir and back, which is not the most efficient format.

josevalim avatar Nov 19 '22 08:11 josevalim

https://pola-rs.github.io/polars-book/user-guide/howcani/data/strings.html

So this one is certainly possible but opens up a bit of a can of worms. In the Python Polars, there is a string namespace. I think we could totally do that. Something like Series.String.lengths or Series.String.filter. There's a regex builder re-exported by polars so that's also pretty straightforward. So I guess it's mostly a design decision. As someone who works with a lot of text, I think it'd be very useful. @philss @josevalim what do you think about something like Series.String? Is there another option I'm not seeing? The only pause I have is whether we'd want to then namespace other things. E.g. Series.List in support of #296?

cigrainger avatar Dec 11 '22 22:12 cigrainger

Honestly, I am not the biggest fan of the namespacing, it feels everything could fit in a single namespace and in the worst case we use string_ prefixes? It looks like less to manage and closer to SQL too?

josevalim avatar Dec 11 '22 22:12 josevalim

Sure, I'm just a bit wary of confusing verbiage but agreed it's best to give it a try in the single namespace first and if it feels painful look at string_ prefixes.

cigrainger avatar Dec 11 '22 22:12 cigrainger

I'm planning to close this issue this week. We now have two new functions to work with regexes: re_contains/2 and re_replace/3 - see #894.

I would like to add 4 more functions:

  • Series.count_matches/2 mirroring Python Polars' Series.count_matches/2, but only for literals.
  • Series.re_count_matches/2 which is the same function from above, but with regex support.
  • Series.extract_all/2 mirroring the function of the same name in Polars, but this time it only accepts a string representing a regex. The resultant series is of type {:list, :string}.
  • Series.extract_groups/2 that also only accepts a string representing a regex. It mirrors the function of the same name in Polars. The resultant dtype is {:struct, [{string(), :string}]}, with the key names as the group indexes or names.

Please let me know if you see any problems with this, or if you think we could name them differently.

philss avatar Apr 14 '24 20:04 philss

Awesome work, @philss!

What about re_extract_all and re_extract_groups to hint that the input is a regex? Do we have other functions which accept regex but don't have that prefix?

billylanchantin avatar Apr 14 '24 21:04 billylanchantin

@billylanchantin Thanks! :D And yeah, I think it would be nice to stick to the pattern and use re_ to hint that the input is a regex. I will use that.

Do we have other functions which accept regex but don't have that prefix?

No, we don't have. All of the others accept literals.

philss avatar Apr 15 '24 01:04 philss

Agreed on the proposed new functions!

josevalim avatar Apr 15 '24 12:04 josevalim

Btw, Elixir has scan (extract_all) and named_captures (extract_groups). Should we name them re_scan and re_named_captures to mirror Elixir?

josevalim avatar Apr 15 '24 19:04 josevalim

@josevalim yeah, I agree we can mirror Elixir. re_scan and re_named_captures will be.

I'm facing a little problem with the re_named_captures implementation that is: we need to calculate the dtype, but we cannot do it without run the regex against the backend. This is not a problem for the eager backend, but it is a problem when we are using a "lazy series". I'm thinking of having an optional "names" argument that is required for the lazy series, and we always name the fields following these names. The downside is that people may need to repeat the names inside the regex and outside it. WDYT?

It would like like this:

DF.mutate(df, [parts: re_named_captures(a, "(a|b)=([0-9]+)"), ["key", "value"])])

philss avatar Apr 15 '24 19:04 philss

@philss Rust regex crate should have a function that extract the names of a regex. Could we perhaps use it? Or is the issue that we don't call the lazy backend at all when building a query?

josevalim avatar Apr 15 '24 19:04 josevalim

Or is the issue that we don't call the lazy backend at all when building a query?

@josevalim yeah, we don't touch the lazy backend at the moment we build the query. So we cannot (or we should not) call the backend at this point.

philss avatar Apr 15 '24 20:04 philss

Let’s skip this function for now then. Eventually we should have a way of calling the backend to get this info.

josevalim avatar Apr 15 '24 20:04 josevalim

I will try one last thing, and if it does not work, I will give up for now.

philss avatar Apr 15 '24 20:04 philss