python-pathspec Performance improvement?

Hello.

We are users of pathspec in some other project. I have a performance question.

For a long list of rules (dozens) matches large amount of files (hundreds of thousands) the match_file takes a long time. Is there any method to improve its performance? For example, using a big regex instead of multiple small ones.

May 31 '20 10:05 karajan1001

Can you provide an example of how you're specifically performing the matches? About how long is a long time? Is it on the order of minutes, hours, or days? This will help me look into the performance issue.

Jun 08 '20 00:06 cpburnz

For example, It would take 20μs for each file. And 2 seconds for 100k file. And if we use big regex and use if expression to skip the normalization in the UNIX system. It could be 100ms (maybe several hundred for Windows users). This could give great help to user experience in the interactive tools relied on path specification.

Jun 16 '20 02:06 karajan1001

I checked pathspec against gitignorefile on this branch https://github.com/excitoon/3/tree/pathspec . On big project (16188 directories, 204718 files) it is still faster:

real	0m43.853s

vs

real	0m25.885s

I'll check if I can fix it.

Aug 28 '22 06:08 excitoon

I made it to:

real	0m28.939s

so far. Thing is, gitignorefile's results are more precise, and if I could afford wrong results, it would be much more fast.

Aug 28 '22 08:08 excitoon

I got slightly better RE for a start of pattern: (?:^|.+/) instead of ^(?:.+/)?. @cpburnz check that out

Aug 28 '22 18:08 excitoon

Is it worth adding an actual benchmark with e.g. pytest-benchmark or asv?

Sep 02 '22 16:09 bollwyvl

Of course it is

On Fri, Sep 2, 2022, 7:20 PM Nicholas Bollweg @.***> wrote:

Is it worth adding an actual benchmark with e.g. pytest-benchmark https://pypi.org/project/pytest-benchmark/ or asv https://asv.readthedocs.io/en/stable/?

— Reply to this email directly, view it on GitHub https://github.com/cpburnz/python-pathspec/issues/38#issuecomment-1235688297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARGSB5QNTB6GVKOFGFC5O3V4ISMXANCNFSM4NPDKCWA . You are receiving this because you commented.Message ID: @.***>

Sep 02 '22 16:09 excitoon

It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.

Feb 17 '23 03:02 Dobatymo

It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.

👀 It is possible, from my experimentation:

for multiple normal lines, I can just or them together: pattern1|pattern2
for negation lines, I can do this (?!negation_regex)(?:previous_regex).

then you end up with one long pattern like (?!negation5)(?:(?!negation3)(?:pattern1|pattern2)|pattern4)

But I have a completely different implementation so idk how hard that would be for this project.

I actually have 2 patterns: one that's used if the path is a directory, and one that's used if the path doesn't exist or is a file. That lets me flatten all the patterns into one. But since checking if it's a dir is comparatively slow, I also have a setting to not check and assume everything passed in is a file such that foo/ matches foo/bar but not foo even when foo is a folder.

I'm still working on fixing #74

Mar 12 '23 03:03 bkarstens

for multiple normal lines, I can just or them together: pattern1|pattern2 for negation lines, I can do this (?!negation_regex)(?:previous_regex).

I only used method 1 in another project and get a significant performance improvement.

Method 2 is something I didn't think of. In my case, I split the pattern into several groups, only the same type of pattern can be joined together.

Mar 19 '23 04:03 karajan1001