re-engine-RE2 set UTF8 match flag to compute utf8 captures correctly and fix core dump

RXf_MATCH_UTF8 flag is required so that the matches ($1, $2...) seen by perl are computed correctly. Without this, the current code will have strings with wrong starting offset

Oct 27 '21 21:10 todd-richmond

dupe() must use the old pattern() as the StringPiece object is not the pattern. This caused core dumps in perl 5.34 with gcc 11

However, I just looked at the code to re-engine-PCRE2 and it doesn't make a new object. I don't know enough about regex+XS to know if this causes a leak in re-engine-RE2 or if PCRE2 is wrong, but one seems to be incorrect regexp *re = RegSV(rx); return re->pprivate;

Oct 27 '21 21:10 todd-richmond

there is 1 other call to new RE2(...) that I'm not sure about. It is trying to "stringify()", but it seems like that logic should just be simply passing the const char *exp variable as in new RE2(exp, otions) Nevermind - I think that would lose any changes from stringify() call

Oct 27 '21 22:10 todd-richmond

This will allow you to remove the unicode handling from the BUGS section in README too

Dec 18 '21 01:12 todd-richmond

Hi Todd, I came across your pull request when researching the bugs section of this module. Since we handle many Unicode emails, we are hesitant to put re2 in production. I'm wondering if you are using your patched version in a live system?

Oct 21 '22 03:10 amit777

Hi, this is still something I'd like to address, sorry I hadn't responded sooner...

This will allow you to remove the unicode handling from the BUGS section in README too

The fix here doesn't handle the fact a string in perl can be either UTF8 or latin1. It's not just compiling the regexp that matters, but a UTF-8 compiled regexp can be matched with Latin1 text and the other way around.

In particular these TODO tests would be a starting point: https://github.com/dgl/re-engine-RE2/blob/master/t/07.utf8.t but I think there would need to be more tests too.

I'm concerned this change doesn't change any tests, in particular it's using the UTF8 flag of the regexp to decide if the match should be UTF8 or not, which seems like a misunderstanding of how this works. Although it may appear to work because UTF-8 text is far more common than latin1 thesedays, I'd like to see tests exercising all combinations of match string and regexp with UTF8.

Since we handle many Unicode emails, we are hesitant to put re2 in production.

If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this). That's kind of what the BUGS section is trying to get at. I know this code is used in various email systems already.

Oct 21 '22 04:10 dgl

Thank you! I'm a little confused by your comment: "If you know both the regexp you're matching and the string are in the same character set you can use this safely (e.g. via scoping the compilation of a regexp using this)."

Could you give me an example of where/how they could be different?

Oct 21 '22 05:10 amit777

Could you give me an example of where/how they could be different?

The test I mentioned is the easiest, just remove the TODO lines and try this:

use strict;
use Test::More tests => 3;
#use re::engine::RE2;

{
  use utf8;

  # Match UTF-8 LHS against UTF-8 RHS
  ok "£" =~ /^£$/;
}

{
  use utf8;

  ok "\x{a3}" =~ /£/;
}

{
  use utf8;

  ok "£" =~ /^\x{a3}$/;
}

I commented out use re::engine::RE2 and they all passed. Uncomment and you'll see the last two fail. This test is depending on the fact that "\x{a3}" is a pound sign but perl will use latin1 encoding for it. Usually perl hides this from you, but you can use Dump from Devel::Peek to see what the value of the UTF8 flag is.

Oct 21 '22 05:10 dgl

Ahh, I see.. based on your examples, I believe our regex will be negatively impacted by this bug. I'm no where near an expert in how to fix this, but thank you for looking at it. I'm happy to test any fixes if you get a chance to take a stab at it!

Oct 21 '22 06:10 amit777

@amit777 we use this fix for 100s million to billions of emails per day in production so it is stress tested extensively. However, the calling code is for URL extraction and so we have 2 different regexes compiled - one latin1 regex to use against latin1 text and a much more complex/slow utf8 regex to use against utf8 text. As a result, I haven't verified all the cases mentioned above

@dgl i based these changes on code form the PCRE and PCRE2 XS implementations. It definitely fixes completely incorrect match offset values and crashes in our 2 regex case. There might be further work for other cases, but if it doesn't cause regressions, it will at least help those who ever need to act on utf8 text with utf8 regexes

Oct 24 '22 19:10 todd-richmond