varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

Fix VRE_capture() interface wrt more matching groups than group count

Open nigoroll opened this issue 4 years ago • 12 comments

The new VRE_capture() interface does not provide a way to detect the case where the number of matching subexpressions is higher than the group count passed.

This patch basically restores the original pcre (pcre1 if you will) behaviour to return a number higher than the number of groups to signal this case.

Alternatively, if we wanted to have the pcre2-like semantics ...

The return from pcre2_match() is [...] zero if the vector of offsets is too small

... we would basically need to make the VRE_capture() interface identical to vre_capture() and allow to pass a count pointer, otherwise we had no way to return the number of matches.

ping @slimhazard

ref: https://gitlab.com/uplex/varnish/libvmod-re/-/merge_requests/1

nigoroll avatar Oct 28 '21 17:10 nigoroll

On a related question: Shouldn't callers of VRE_capture() be given a way to query how many capture groups are required?

nigoroll avatar Nov 04 '21 13:11 nigoroll

Another related question: Or should VRE_compile() offer a way to specify "no more than X capture groups"?

nigoroll avatar Nov 04 '21 13:11 nigoroll

On a related question: Shouldn't callers of VRE_capture() be given a way to query how many capture groups are required?

pcre2 has the means to get that info from a compiled regex, so VRE could expose it.

slimhazard avatar Nov 04 '21 17:11 slimhazard

Another related question: Or should VRE_compile() offer a way to specify "no more than X capture groups"?

Not sure what you mean by this. If the caller doesn't want to capture more than N groups, well then don't.

One way to ensure that is to write the regex so that it has no more than N captures.

Do you have something in mind like "tell pcre2 to never capture than N groups, because I'll never need more"? pcre2 has a lot of bells and whistles, but I don't recall that one. There are so many ways to configure pcre2 that I could easily be wrong. But I think it's likely that Philip Hazel reasoned that if you don't want more than N captures, then you wouldn't have more in your regex in the first place.

slimhazard avatar Nov 04 '21 17:11 slimhazard

If the caller doesn't want to capture more than N groups, well then don't.

The point is to fail the compilation if the pattern contains more than N groups, rather than failing later for a match.

nigoroll avatar Nov 05 '21 09:11 nigoroll

The point is to fail the compilation if the pattern contains more than N groups, rather than failing later for a match.

Got it. That's certainly possible and should be easy, same reason as for the other point -- pcre2 can tell you how many groups there are for a compiled regex.

slimhazard avatar Nov 05 '21 10:11 slimhazard

About that, VCL regsub[all]() and the underlying VRT_regsub() and VRE_sub() are limited to positional capture groups, so that limits us to 9 capture groups plus the special "all of them" group 0. In order not to break existing VCL, we didn't adopt pcre2_substitute() that has its own very similar syntax but different ($1 vs \1). It also has named capture groups which we don't take advantage of for the same reason.

Because of this hard limit, I'm tempted to say querying the number of groups should be a pcre2 operation not available in VRE, accessible via VRE_unpack().

dridi avatar Nov 08 '21 14:11 dridi

After the last comment from @Dridi I have now added a compile time check for a maximum of 9 capture groups: As explained by @Dridi , we have no facility in varnish-cache to use more capture groups, so we should not accept them at compile time. As a side effect, this could also avoid potential memory consumption issues when matching.

For the case where capture groups are not used for capturing, a hint is output with the error message:

too many capturing groups (maximum 9) - tip: use (?:...) for non capturing groups

ping @slimhazard

nigoroll avatar Nov 16 '21 13:11 nigoroll

Has it been possible in the past to have more than 9 capturing groups in a regex, and yet not a problem as long as you don't go higher than 9 in any use of regsub()? That's a marginal case, of course, I'm just wondering if this could be someone's breaking change, if the number of groups has never been checked as an error condition before.

slimhazard avatar Nov 16 '21 13:11 slimhazard

@slimhazard yes, this is a breaking change, thus the tip in the error message

nigoroll avatar Nov 16 '21 13:11 nigoroll

I think it is too heavy-handed to limit the number of capturing groups at regex compile time, and it goes against the spirit of 6503249ca94bf6b0729e290c422f5169fe1a9ccc.

Not everyone knows that a group doesn't have to capture:

  • (capturing )?group
  • (?:simply optional)?group

My comment about the 10 capturing groups was regarding our regsub() implementation that uses its own syntax with substitution from \0 to \9. As far as we are concerned, we only support the 9 capturing groups plus the special 0th group. It shouldn't be forbidden to have more capture groups, it's simply that I don't see the lack of support for querying the number of capturing groups as a regression since to my knowledge the former VRE API didn't offer that, and contrary to the former VRE API we can access the pcre2_code directly.

In its current state, the new VRE API should be an improvement in that regard, not a regression.

Now regarding the original problem, the way you initially tried to solve it in 9a2fa48e23a0b21d6109cc48897e997f6af29cbb is probably better if you:

  • aren't going to query the number of groups in the first place (outside of VRE)
  • want to pass an arbitrary number of txt (venturing in undefined behavior land abve 10 groups)
  • plan to react to a capture overflow after the facts

Now that I have a better understanding, I think the original patch is better. That makes VRE_match() a VRE_capture() with less arguments, which is still an improvement in my mind.

dridi avatar Nov 17 '21 12:11 dridi

Related:

Working on a recent addition to the re vmod, the question arose if the VRE API could be changed to allow this use case.

Unless I overlook something, in addition to this ticket, only a JIT option argument would be needed in VRE_compile().

@Dridi I think I might also have understood a reason for PCRE2 to use offsets for captures: They are a natural choice when relocating the subject (you can just keep them unaltered if you do).

nigoroll avatar Nov 08 '22 13:11 nigoroll