Add C accelerator for `fnmatch`
Enhancement
By adding a C accelerator, it is possible to improve the running time of fnmatch.filter and fnmatch.translate (the previous proposal is no longer correct since fnmatch(3) does not produce the same results as Python) by 10% and 50% respectively.
fnmatch.translate
With --with-pydebug
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-c |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 21.0 us | 12.0 us: 1.74x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 22.0 us | 12.3 us: 1.78x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 8.32 us | 2.97 us: 2.81x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 7.41 us | 4.03 us: 1.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 11.1 us | 7.24 us: 1.53x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.90x faster |
+------------------------------------------+-----------------------+-----------------------+
Without --with-pydebug
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-c |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.88 us | 4.30 us: 1.60x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 7.12 us | 4.43 us: 1.61x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 2.47 us | 912 ns: 2.71x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 2.26 us | 1.28 us: 1.76x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 3.46 us | 2.35 us: 1.48x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.78x faster |
+------------------------------------------+-----------------------+-----------------------+
With --enable-optimizations
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-c |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.30 us | 3.80 us: 1.66x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us | 3.99 us: 1.62x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 2.17 us | 812 ns: 2.67x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 1.98 us | 1.15 us: 1.72x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 3.09 us | 2.12 us: 1.45x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.78x faster |
+------------------------------------------+-----------------------+-----------------------+
While the two first patterns are somewhat excentric, the last two patterns are much more interesting and much more common. In particular, fnmatch.translate can be 1.5x faster.
Benchmark script
import pyperf
import _fnmatch
import fnmatch
import os
from fnmatch import _translate, _join_translated_parts
def translate_py(pattern):
STAR = object()
parts = _translate(pattern, STAR, '.')
return _join_translated_parts(parts, STAR)
def translate_ref(loops, pattern):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
translated = translate_py(pattern)
return pyperf.perf_counter() - t0
def translate_c(loops, pattern):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
res = _fnmatch.translate(pattern)
return pyperf.perf_counter() - t0
PATTERNS = [
'abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
'!abc/[!]b-ac-z9-1]/def/\\*?/*/**c/?*[][!]',
'a**?**cd**?**??k***',
'a/**/b/**/c',
'man/man1/bash.1'
]
def bench(runner, func):
for pattern in PATTERNS:
runner.bench_time_func(pattern, func, pattern)
def add_cmdline_args(cmd, args):
cmd.append(args.impl)
if __name__ == '__main__':
runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
runner.argparser.add_argument('impl', choices=['ref', 'c'])
args = runner.parse_args()
if args.impl == 'ref':
bench(runner, translate_ref)
elif args.impl == 'c':
bench(runner, translate_c)
fnmatch.filter
For simplicity, I removed the calls to os.path.normcase so that we may know which parts can be accelerated and at what cost. Since fnmatch.translate has not call to normcase, it's not an issue but fnmatch.fnmatch and fnmatch.filter have calls to os.path.normcase beforehand. Thus, the timings that I really did were timings for POSIX platforms and for fnmatch.fnmatchcase.
With --with-pydebug
+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-filter-ref | fnmatch-filter-py | fnmatch-filter-c |
+===================+====================+=======================+=======================+
| small_match_none | 1.12 us | 1.41 us: 1.26x slower | 1.15 us: 1.03x slower |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some | 1.48 us | 1.60 us: 1.08x slower | 1.45 us: 1.02x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all | 1.69 us | 1.67 us: 1.01x faster | 1.61 us: 1.05x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 9.09 us | 8.39 us: 1.08x faster | 8.98 us: 1.01x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 12.6 us | 10.9 us: 1.16x faster | 11.4 us: 1.10x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all | 14.7 us | 12.6 us: 1.16x faster | 13.0 us: 1.12x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none | 89.0 us | 78.3 us: 1.14x faster | 86.3 us: 1.03x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some | 123 us | 106 us: 1.16x faster | 111 us: 1.11x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all | 145 us | 118 us: 1.23x faster | 126 us: 1.15x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.06x faster | 1.06x faster |
+-------------------+--------------------+-----------------------+-----------------------+
Without --with-pydebug
+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-filter-ref | fnmatch-filter-py | fnmatch-filter-c |
+===================+====================+=======================+=======================+
| small_match_none | 320 ns | 393 ns: 1.23x slower | 315 ns: 1.01x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some | 401 ns | 447 ns: 1.11x slower | 376 ns: 1.07x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all | 449 ns | 487 ns: 1.08x slower | 426 ns: 1.05x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.49 us | 2.40 us: 1.03x faster | 2.34 us: 1.06x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 3.23 us | 2.85 us: 1.13x faster | 2.83 us: 1.14x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all | 3.77 us | 3.48 us: 1.08x faster | 3.39 us: 1.11x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none | 25.6 us | 23.8 us: 1.08x faster | 24.2 us: 1.06x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some | 32.2 us | 27.5 us: 1.17x faster | 27.5 us: 1.17x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all | 38.0 us | 34.3 us: 1.11x faster | 33.6 us: 1.13x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.02x faster | 1.09x faster |
+-------------------+--------------------+-----------------------+-----------------------+
With --enable-optimizations
+-------------------+--------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-filter-ref | fnmatch-filter-py | fnmatch-filter-c |
+===================+====================+=======================+=======================+
| small_match_none | 305 ns | 391 ns: 1.28x slower | 316 ns: 1.04x slower |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_some | 377 ns | 429 ns: 1.14x slower | 359 ns: 1.05x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| small_match_all | 410 ns | 475 ns: 1.16x slower | 392 ns: 1.04x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_none | 2.34 us | not significant | 2.26 us: 1.04x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_some | 2.99 us | 2.82 us: 1.06x faster | 2.63 us: 1.14x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| medium_match_all | 3.41 us | 3.35 us: 1.02x faster | 3.05 us: 1.12x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_none | 24.1 us | 23.1 us: 1.04x faster | 22.4 us: 1.07x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_some | 29.6 us | 26.4 us: 1.12x faster | 24.5 us: 1.21x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| large_match_all | 33.1 us | 31.2 us: 1.06x faster | 28.6 us: 1.16x faster |
+-------------------+--------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 1.03x slower | 1.09x faster |
+-------------------+--------------------+-----------------------+-----------------------+
Benchmark script
import pyperf
import _fnmatch
import fnmatch
import os
from fnmatch import _compile_pattern
def fnmatch_ref(loops, names, pat):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
result = []
match = _compile_pattern(pat)
for name in names:
if match(name):
result.append(name)
return pyperf.perf_counter() - t0
def fnmatch_py(loops, names, pat):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
match = _compile_pattern(pat)
res = list(filter(match, names))
return pyperf.perf_counter() - t0
def fnmatch_c(loops, names, pat):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
res = _fnmatch.filter(names, pat)
return pyperf.perf_counter() - t0
def bench(runner, func):
base_data = ['Python', 'Ruby', 'Perl', 'Tcl']
multipliers = [('small', 1), ('medium', 10), ('large', 100)]
patterns = [('none', 'A*'), ('some', 'P*'), ('all', '*')]
for data_label, data_size in multipliers:
for pat_label, pattern in patterns:
name = f'{data_label}_match_{pat_label}'
data = base_data * data_size
_compile_pattern.cache_clear()
runner.bench_time_func(name, func, data, pattern)
def add_cmdline_args(cmd, args):
cmd.append(args.impl)
if __name__ == '__main__':
runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
runner.argparser.add_argument('impl', choices=['ref', 'py', 'c'])
args = runner.parse_args()
if args.impl == 'ref':
bench(runner, fnmatch_ref)
elif args.impl == 'py':
bench(runner, fnmatch_py)
elif args.impl == 'c':
bench(runner, fnmatch_c)
The reason why the C implementation is slow in debug builds is because I tend to add assert() a bit everywhere...
Linked PRs
- gh-121446
Could you run benchmarks for more optimized python version?
normcase = os.path.normcase
os_path = os.path
def filter(names, pat):
pat = normcase(pat)
match = _compile_pattern(pat)
if os_path is posixpath:
return list(filter(match, names))
else:
ncased = map(match, map(normcase, names))
return list(itl.compress(names, ncased))
Also, don't run list on return value of filter as it already returns a list:
res = list(filter(match, names))
res = list(filter(match, names))
The 'filter' here is the built-in filter, not the fnmatch's filter. So I must use list. Your suggestion is the same as the fnmatch_py benchmark by the way. Note that the timings I shown are without any calls to normcase because I don't know how to use them in C (for now).
Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms. I'll first try to make the C implementation compile on the CI/CD (I can run it fine on my laptop but I didn't know how to correctly update the entire ecosystem, but I think I managed to today).
What a blunder. I did not thought that:
#include <fnmatch.h>
#include <stdio.h>
void main() {
int res = fnmatch("[a-z]", "a", 0);
printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes"); // 'yes'
res = fnmatch("[^-`]", "~", 0);
printf("match: %s\n", res == FNM_NOMATCH ? "no" : "yes"); // 'yes'
}
but in Python... ~ would not match [^-`]. By looking at the fnmatch(3) implementation, I think the character classes are only allowed to be alpha numerics.
EDIT: let's forget about fnmatch(3) because the latter actually supports [[:alpha:]] classes which the Python implementation does not.
Also, for non-posixpath paths, I will not use them since the implementation is only available on Windows platforms.
It can be run on Unix too, but is only executed on windows. You can test it on Unix and it will be a good proxy for it.
In the end everything is the same as in Python. I'll regenerate the benchmarks as well.
Implementation of ideas from #122289 at the C level gave even better performances in a release (non-PGO) build:
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-c |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.09 us | 1.26 us: 4.84x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.40 us | 1.29 us: 4.97x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 2.17 us | 303 ns: 7.16x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 2.00 us | 286 ns: 7.00x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 3.04 us | 459 ns: 6.63x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** | 5.14 us | 685 ns: 7.50x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 6.26x faster |
+------------------------------------------+-----------------------+-----------------------+
And by implementing more ideas, I get in PGO builds
+------------------------------------------+-----------------------+-----------------------+
| Benchmark | fnmatch-translate-ref | fnmatch-translate-c |
+==========================================+=======================+=======================+
| abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.15 us | 1.21 us: 5.07x faster |
+------------------------------------------+-----------------------+-----------------------+
| !abc/[!]b-ac-z9-1]/def/\*?/*/**c/?*[][!] | 6.45 us | 1.24 us: 5.19x faster |
+------------------------------------------+-----------------------+-----------------------+
| a**?**cd**?**??k*** | 2.15 us | 287 ns: 7.49x faster |
+------------------------------------------+-----------------------+-----------------------+
| a/**/b/**/c | 1.94 us | 269 ns: 7.21x faster |
+------------------------------------------+-----------------------+-----------------------+
| man/man1/bash.1 | 3.00 us | 427 ns: 7.03x faster |
+------------------------------------------+-----------------------+-----------------------+
| a*b*c*d*e*f*g*h*i*j*k*l*m*n*o** | 5.07 us | 636 ns: 7.98x faster |
+------------------------------------------+-----------------------+-----------------------+
| Geometric mean | (ref) | 6.56x faster |
+------------------------------------------+-----------------------+-----------------------+
We are going very fast now!
One question. Is the implementation actually 2K lines long?
Yes, but there are a lot of comments. So it's probably 1k long I think (in those 2k, there are ~400 that are clinic, 150 that are tests).
Performance results look good. Hopefully code can be simplified though.
The code could be simplified by removing the implementation of filter, fnmatch and fnmatchcase and only keep translate in C. The speedy implementation is only 500 lines long (translate.c). And there are some parts that I needed to C/C from unicodeobject.c since the latter does not export unicode_char.
That sounds sensible. It doesn't seem there is going to be any significant performance improvements from having any other functions written in C.
Yes, I'm working on an alternate branch that removes those bits. We can always re-add them later but yes, the performances are really not that impressive (10% gain is still good but not that good).
Following https://github.com/python/cpython/pull/123181#issuecomment-2299791373 and https://github.com/python/cpython/pull/123181#issuecomment-2301289335, I decided to close this feature. While the improvements are significant, they will not necessarily be that significant in production. I suggest we rather use https://github.com/python/cpython/pull/122289 which already bring an almost 2x improvement and is a lot easier to maintain.