Improve Windows performance of `pathlib.Path.is_file` and friends
Feature or enhancement
We could implement / reuse the same fast paths that we did for os.path.* in #101324 for pathlib.Path.*. It will require a little more work, since the pathlib.Path API can return OSError under certain circumstances, whereas os.path would return False in those cases.
Pitch
We saw a 12-25% speedup doing this for os.path. It would be unfortunate if people chose the lower-level API for performance reasons because pathlib.Path didn't match it in performance.
Linked PRs
- gh-101361
- gh-118243
It would be a shame to lose the generic implementations in pathlib -- they'll be useful in a future AbstractPath class where stat() is abstract.
@barneygale, wouldn't this enhancement entail WindowsPath overriding the base Path implementations?
We support subclassing as of #31691, which means that a user on Windows can do this:
class MyClass(pathlib.Path):
pass
P = MyClass()
... and expect that p works exactly as if it was a WindowsPath. But it's not actually a subclass of WindowsPath - its behaviour is controlled entirely by the value of its _flavour attribute.
Hence I'm hesistant to add things to WindowsPath and PosixPath.
We can do like what you implemented for os.path.realpath(). Add a strict=False keyword-only parameter to exists(), isfile(), isdir(), islink(), and isjunction(). In strict mode it ignores only a select set of errors. Move the definition of _ignore_error() into ntpath and posixpath, with _IGNORED_ERRNOS defined in genericpath.
:+1: that sounds good to me! Just as long as we adjust the pathlib.Path code and not WindowsPath specifically if that's OK. Hopefully my draft PR illustrates what I mean here.
Should nt._exists() and the others be modified to raise an exception that gets handled by a wrapper function such as ntpath.exists()? Or should we have a list of ignored error codes in the C implementation that's kept in sync with _IGNORED_ERRNOS and _IGNORED_WINERRORS? I guess it would help to benchmark the performance cost of raising and handling an exception for common errors that should always be handled by returning False, such as ERROR_FILE_NOT_FOUND and ERROR_PATH_NOT_FOUND, compared to simply returning False from the builtin function.
This should be the complete set of ignored Windows error codes:
// ENOENT
ERROR_FILE_NOT_FOUND
ERROR_PATH_NOT_FOUND
ERROR_INVALID_DRIVE
ERROR_NO_MORE_FILES
ERROR_BAD_NETPATH
ERROR_BAD_NET_NAME
ERROR_BAD_PATHNAME
ERROR_FILENAME_EXCED_RANGE
// ENOTDIR
ERROR_DIRECTORY
// EBADF
ERROR_INVALID_HANDLE
ERROR_INVALID_TARGET_HANDLE
ERROR_DIRECT_ACCESS_HANDLE
// ELOOP
ERROR_CANT_RESOLVE_FILENAME
// others
ERROR_NOT_READY
ERROR_INVALID_NAME
I wouldn't mind if the is_*() methods were changed to suppress all OSErrors. The current selection is undocumented and fairly arbitrary. Users can call stat() if they need an exception to be raised.
Oh, but we'd need to add optional follow_symlinks arguments to os.path.exists(), isfile() and isdir(), in order to support the equivalent arguments in pathlib methods. Or we'd need to add an islink() check to each of those pathlib methods, which would slow things back down.
Or switch function called in the pathlib methods based on that parameter.
In newer versions of Windows, stat is going to be almost as fast as the "fast path" functions (and only because we updated the fast path functions to use the faster stat call, or else they'd be slower), so eventually it won't matter too much which call is used. Not following symlinks should always be faster anyway.
PR: #118243
Resolved in 3.14 / fbe6a09 / #118243. Thanks everyone.
Apologize for my confusion, it has been a while since I followed this ticket. I was wondering if someone could clarify a little bit about the situation.
From what I can see, the closing commit/PR only seems to change behaviors about OSErrors.
Would it still improve the performance of pathlib.Path.is_file (i.e. what this ticket is about)?
@zooba's comment seems to imply that the improvement was already made at other places, but needing "newer version of Windows".
In other words, if I upgraded to 3.14 in future, but have to use "older" version of Windows (for example. Win10), would I still get any notifiable performance improvement on pathlib.Path.is_file after this patch?
Thanks in advance!
From what I can see, the closing commit/PR only seems to change behaviors about
OSErrors. Would it still improve the performance ofpathlib.Path.is_file(i.e. what this ticket is about)?
The change of OSError handling means the pathlib methods are now compatible with the os.path functions, and so we now call those functions:
https://github.com/python/cpython/blob/fbe6a0988ff08aef29c4649527d5d620d77ca4a2/Lib/pathlib/_local.py#L505-L549
Thank you, now I understand.