Performance improvement?
Hello.
We are users of pathspec in some other project. I have a performance question.
For a long list of rules (dozens) matches large amount of files (hundreds of thousands) the match_file takes a long time. Is there any method to improve its performance?
For example, using a big regex instead of multiple small ones.
Can you provide an example of how you're specifically performing the matches? About how long is a long time? Is it on the order of minutes, hours, or days? This will help me look into the performance issue.

For example, It would take 20μs for each file. And 2 seconds for 100k file. And if we use big regex and use if expression to skip the normalization in the UNIX system. It could be 100ms (maybe several hundred for Windows users). This could give great help to user experience in the interactive tools relied on path specification.
I checked pathspec against gitignorefile on this branch https://github.com/excitoon/3/tree/pathspec . On big project (16188 directories, 204718 files) it is still faster:
real 0m43.853s
vs
real 0m25.885s
I'll check if I can fix it.
I made it to:
real 0m28.939s
so far. Thing is, gitignorefile's results are more precise, and if I could afford wrong results, it would be much more fast.
I got slightly better RE for a start of pattern: (?:^|.+/) instead of ^(?:.+/)?. @cpburnz check that out
Is it worth adding an actual benchmark with e.g. pytest-benchmark or asv?
Of course it is
On Fri, Sep 2, 2022, 7:20 PM Nicholas Bollweg @.***> wrote:
Is it worth adding an actual benchmark with e.g. pytest-benchmark https://pypi.org/project/pytest-benchmark/ or asv https://asv.readthedocs.io/en/stable/?
— Reply to this email directly, view it on GitHub https://github.com/cpburnz/python-pathspec/issues/38#issuecomment-1235688297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARGSB5QNTB6GVKOFGFC5O3V4ISMXANCNFSM4NPDKCWA . You are receiving this because you commented.Message ID: @.***>
It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.
It would be great if there was a way to combine multiple patterns from different lines into larger regexes automatically.
👀 It is possible, from my experimentation:
- for multiple normal lines, I can just or them together:
pattern1|pattern2 - for negation lines, I can do this
(?!negation_regex)(?:previous_regex).
then you end up with one long pattern like (?!negation5)(?:(?!negation3)(?:pattern1|pattern2)|pattern4)
But I have a completely different implementation so idk how hard that would be for this project.
I actually have 2 patterns: one that's used if the path is a directory, and one that's used if the path doesn't exist or is a file. That lets me flatten all the patterns into one. But since checking if it's a dir is comparatively slow, I also have a setting to not check and assume everything passed in is a file such that foo/ matches foo/bar but not foo even when foo is a folder.
I'm still working on fixing #74
for multiple normal lines, I can just or them together: pattern1|pattern2 for negation lines, I can do this (?!negation_regex)(?:previous_regex).
I only used method 1 in another project and get a significant performance improvement.
Method 2 is something I didn't think of. In my case, I split the pattern into several groups, only the same type of pattern can be joined together.