zoomerjoin icon indicating copy to clipboard operation
zoomerjoin copied to clipboard

include ability to match on regex

Open werkstattcodes opened this issue 2 years ago • 3 comments

Hello,

first congrats to this amazing package. I am simply stunned by its performance.

I was wondering whether there is any way to expand the set of functions to include regex matches. I am currently using fuzzyjoin::regex_left_join, but unfortunately, in my use case it's simply too slow.

Unfortunately, I don't know Rust, so I can't contribute anything in this regard. In any case, thanks again for this powerful package which should be known much more widely.

Roland

werkstattcodes avatar May 31 '23 06:05 werkstattcodes

Hi Roland,

Thanks for reaching out, and thanks for your kind words about my package.

It surprises me a little to hear that the fuzzyjoin implementation is not fast enough for you - if I understand regex-joins correctly, I believe they should finish in linear time using linear memory. This is because (again, as I understand it) they simply extract rip a regular expression out of the columns they match on and then perform exact matching on the extracted string. Both steps take linear time, so the algorithm should also run in linear time. If the runtime is scaling super-linearly, that might point to an issue with the implementation in the fuzzyjoin package.

This said, I would be interested in adding this functionality to the package, and will have a think about it over the next few weeks. A rust-implementation would increase speed relative to the R implementation and also be easy to multi-thread, so I think it would be worth exploring.

In the meantime, I would suggest using a combination of str_extract function from stringr and the logical joins from dplyr to try and process the joins manually. This would allow you to process the joins in linear time, even if it is slightly cumbersome.

Best, and hope this helps, Ben

beniaminogreen avatar May 31 '23 17:05 beniaminogreen

I've decided that joining on regular expressions is within scope, and am starting to work on a prototype on the regex_join branch. I am working on the documentation, and still exploring the behavior I would like the regex_join family of functions to have, but please feel free to let me know if you have any feedback.

Edit: Just read the fuzzyjoin documentation, and I realize I did not understand what most people mean by regex inner join - I thought it meant extract a set of regular expressions from a column, and then join on the matches. I may revise to change the behavior so it's more like fuzzyjoin.

beniaminogreen avatar Jun 18 '23 23:06 beniaminogreen

Brillant. Very much looking forward to trying it out. Many thanks.

werkstattcodes avatar Jun 19 '23 00:06 werkstattcodes