App-Rak icon indicating copy to clipboard operation
App-Rak copied to clipboard

Performance issue with --pattern-from

Open Zer0-Tolerance opened this issue 3 years ago • 16 comments

Hi Liz,

Another quick fix , when you're using the --pattern-from with regex you're doing a loop with each one of the pattern rather than compiling a single regex with all pattern with | this creates a huge performance penalty because you checking the same file over and over based on the number of line in the input file.
Based on my test on a large CSV 93 MB it takes 294 secs against a 99 regex pattern file vs 84 sec if you convert this as a regex with |. Can you fix this ?

Zer0-Tolerance avatar Nov 24 '22 18:11 Zer0-Tolerance

Just not quite a quick fix, I'm afraid.

OOC, do any of those patterns consist of just looking for a literal string?

lizmat avatar Nov 24 '22 19:11 lizmat

A first stab at merging consecutive regexes into a single regex for multiple patterns

lizmat avatar Nov 24 '22 21:11 lizmat

0.2.14 now in the ecosystem. Please let me know how this affected the performance!

lizmat avatar Nov 24 '22 21:11 lizmat

Great I'll test again and let you know ? All patterns are regex.

Zer0-Tolerance avatar Nov 25 '22 08:11 Zer0-Tolerance

ok with the latest version it's better than before 129 sec now (as opposed to 284 sec) for the list of pattern vs 81 sec with the single regex with |.
This is the cmd I'm using: time rak --ignore-case --patterns-from=/tmp/file.txt /path/file.csv

Zer0-Tolerance avatar Nov 25 '22 08:11 Zer0-Tolerance

ok, that's better. By any chance, are one or more of the regexes literal? Consisting of just / foo / ? Because those are currently converted internally to *.contains("foo"), and thus create multiple regexes, instead of just a single one. Which would explain the remaining difference

lizmat avatar Nov 25 '22 08:11 lizmat

Sorry I forgot to add the type to the cmd so no pattern is interpreted as literal in my list right ? time rak --ignore-case --type=regex --patterns-from=/tmp/file.txt /path/file.csv

Zer0-Tolerance avatar Nov 25 '22 08:11 Zer0-Tolerance

With a literal regex I mean / foo / where "foo" consists of only alphanumeric characters. I doubt that if you don't change the file, and then do a --type=regex that it would produce the right results, or even run at all??

lizmat avatar Nov 25 '22 08:11 lizmat

my patterns consist of IP addresses , filenames.

Zer0-Tolerance avatar Nov 25 '22 10:11 Zer0-Tolerance

Could you give me an example of the IP address and a filename?

lizmat avatar Nov 25 '22 15:11 lizmat

sure "8.8.8.8" and "test.exe" for example.

Zer0-Tolerance avatar Nov 25 '22 15:11 Zer0-Tolerance

When you have 8.8.8.8 in your regex, you actually mean the string "8.8.8.8", right? But "8a8b8c8" you would not want to match, right?

If that is so, why are you using regexes? Why not just the string "8.8.8.8" (which would do a much cheaper '.contains("8.8.8.8")' on the targets.

lizmat avatar Nov 25 '22 16:11 lizmat

yes but I have also tests?.exe in the same list of pattern.

Zer0-Tolerance avatar Nov 26 '22 00:11 Zer0-Tolerance

ok, then I suggest grouping the ones with wildcards together, and see how that affects the performance. As the logic now will group together any consecutive regexes into a single regex

lizmat avatar Nov 27 '22 11:11 lizmat

FWIW, I'm keeping this one open until it's time to rewrite the expression parsing using RakuAST, and optimize from there

lizmat avatar Jul 29 '24 12:07 lizmat

This has now happened with 0.3.4 https://raku.land/zef:lizmat/App::Rak/changes?v=0.3.4 . This now builds a single Callable for all of the patterns found using RakuAST, not doing anything special for regexes yet.

Please confirm whether this makes things better or worse in your use case.

lizmat avatar Aug 17 '24 10:08 lizmat