uap-java icon indicating copy to clipboard operation
uap-java copied to clipboard

Make it possible to use RE2J instead of java.util.regexp

Open alebastrov opened this issue 2 years ago • 6 comments

As I See it works if only change imports so we need to create a factory for Pattern/Matcher and adaptors

alebastrov avatar Feb 09 '23 14:02 alebastrov

com.google.re2j re2j 1.7 runtime

alebastrov avatar Feb 09 '23 14:02 alebastrov

can you provide more information about the purpose of this feature request?

are you concerned about the runtime performance of matching user-agent strings? or is it about reducing start up time? or does the RE2J support regexp patterns that java.util.regexp doesn't support?

I'm a little hesitant to make uap-java have a hard dependency on re2j since it would require all users to pull in another lib (which may have its own transitive dependencies... although honestly I haven't looked that deep to see if re2j depends on anything).

how would you envision this working? would it be like a java service provider/implementation... whereby the user adds re2j to the classpath and the regexp engine is specified by name at runtime? that might get a little complicated because it would likely require a wrapper around re2j that follows the java service provider spec so it could be plugged in.

bpossolo avatar Feb 26 '23 19:02 bpossolo

Hi I'm concerned about the runtime performance of matching user-agent strings. the regular expression syntax accepted by RE2 is a subset of that accepted by PCRE. I believe your regexp's are not using unsupported features of RE2. Unlike PCRE it has o(n) validation/search time (i.e. each symbol is checking only once). I think creating some interface facade for PCRE and RE2 will be enough.

Page https://swtch.com/~rsc/regexp/regexp3.html#caveats describes sets of features which are not supported (lookahead or lookbehind assertions, backreferences, atomic grouping operators (?>...) and ++)

The main goal for developing it is that RE2 provides stronger guarantees on execution time than and enables high-level analyses that would be difficult or impossible with ad hoc implementations

alebastrov avatar Mar 08 '23 09:03 alebastrov

Hm I see

Object.keys(regexes).forEach(function (parser) {
    suite(`no reverse lookup in ${parser}`, function () {
      regexes[parser].forEach(function(item) {
        test(item.regex, function () {
          if (/\(\?<[!=]/.test(item.regex)) {
            assert.ok(false, 'go parser does not support regex lookbehind. See https://github.com/google/re2/wiki/Syntax')
          }
          if (/\(\?[!=]/.test(item.regex)) {
            assert.ok(false, 'go parser does not support regex lookahead. See https://github.com/google/re2/wiki/Syntax')
          }
        })
      })
    })
  })

Does it mean that RE2 is already implemented?

alebastrov avatar Mar 08 '23 09:03 alebastrov

Does it mean that RE2 is already implemented?

the code you referenced is a javascript unit test in the other repo named uap-core. I don’t know why they’re checking for entries that are unsupported by the go runtime.

bpossolo avatar Mar 08 '23 12:03 bpossolo

I did play a few years ago with different regex libraries using JMH to check performance. It might be interesting to compare against the regexes found in the patterns database. See https://github.com/fbacchella/RegexPerf for code.

fbacchella avatar Oct 01 '25 09:10 fbacchella