urlpattern icon indicating copy to clipboard operation
urlpattern copied to clipboard

Matching `(` and `)` in a regexp matcher requires escaping

Open lucacasonato opened this issue 2 years ago • 4 comments

Fails, even though it's valid:

const pattern = new URLPattern({ pathname: '/([()])' });
Uncaught TypeError: tokenizer error: invalid regex: nested groups must start with ? (at char 1)

OK:

const pattern = new URLPattern({ pathname: '/([\\(\\)])' });

This is because while tokenizing the pattern, we think the second ( is a nested group rather than just a char in a regexp character class.

Fixing this will make the tokenizer more complicated. Is it worth it?

lucacasonato avatar Jul 10 '23 20:07 lucacasonato

Actually I think we need to fix this for sure. Consider these two patterns:

// valid regexp, but pattern tokenizer thinks there is a nested group that isn't closed
const pattern = new URLPattern({ pathname: '/([(?])' });

// valid regexp group, valid urlpattern, except that this throws because "Invalid regular expression: /^(?:/([))\]\)$/u: Unterminated character class" 
const pattern = new URLPattern({ pathname: '/([)])' });

lucacasonato avatar Jul 10 '23 20:07 lucacasonato

Yea, seems like a real problem to me. I'm not sure when I will have the bandwidth to look at this, though. Do you have a proposed fix?

wanderview avatar Jul 10 '23 20:07 wanderview

I think we have to keep track of [ and ] in the tokenizer while parsing regexp tokens, and ignore all ( and ) while between a [ and ] in that regexp. This can be a simple boolean as character classes can't be nested (no need to keep track of depth).

lucacasonato avatar Jul 10 '23 21:07 lucacasonato

character classes can't be nested (no need to keep track of depth).

Is this true? Given #178 we'll support syntax like [\d--[07]] (every digit except 0 and 7) as part of the UnicodeSets regexp mode.

jeremyroman avatar Sep 13 '23 14:09 jeremyroman