cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Improve Windows performance of `pathlib.Path.is_file` and friends

Open mdboom opened this issue 3 years ago • 10 comments

Feature or enhancement

We could implement / reuse the same fast paths that we did for os.path.* in #101324 for pathlib.Path.*. It will require a little more work, since the pathlib.Path API can return OSError under certain circumstances, whereas os.path would return False in those cases.

Pitch

We saw a 12-25% speedup doing this for os.path. It would be unfortunate if people chose the lower-level API for performance reasons because pathlib.Path didn't match it in performance.

Linked PRs

  • gh-101361
  • gh-118243

mdboom avatar Jan 26 '23 22:01 mdboom

It would be a shame to lose the generic implementations in pathlib -- they'll be useful in a future AbstractPath class where stat() is abstract.

barneygale avatar Jan 26 '23 23:01 barneygale

@barneygale, wouldn't this enhancement entail WindowsPath overriding the base Path implementations?

eryksun avatar Jan 26 '23 23:01 eryksun

We support subclassing as of #31691, which means that a user on Windows can do this:

class MyClass(pathlib.Path):
    pass

P = MyClass()

... and expect that p works exactly as if it was a WindowsPath. But it's not actually a subclass of WindowsPath - its behaviour is controlled entirely by the value of its _flavour attribute.

Hence I'm hesistant to add things to WindowsPath and PosixPath.

barneygale avatar Jan 27 '23 00:01 barneygale

We can do like what you implemented for os.path.realpath(). Add a strict=False keyword-only parameter to exists(), isfile(), isdir(), islink(), and isjunction(). In strict mode it ignores only a select set of errors. Move the definition of _ignore_error() into ntpath and posixpath, with _IGNORED_ERRNOS defined in genericpath.

eryksun avatar Jan 27 '23 00:01 eryksun

:+1: that sounds good to me! Just as long as we adjust the pathlib.Path code and not WindowsPath specifically if that's OK. Hopefully my draft PR illustrates what I mean here.

barneygale avatar Jan 27 '23 00:01 barneygale

Should nt._exists() and the others be modified to raise an exception that gets handled by a wrapper function such as ntpath.exists()? Or should we have a list of ignored error codes in the C implementation that's kept in sync with _IGNORED_ERRNOS and _IGNORED_WINERRORS? I guess it would help to benchmark the performance cost of raising and handling an exception for common errors that should always be handled by returning False, such as ERROR_FILE_NOT_FOUND and ERROR_PATH_NOT_FOUND, compared to simply returning False from the builtin function.

This should be the complete set of ignored Windows error codes:

// ENOENT
ERROR_FILE_NOT_FOUND
ERROR_PATH_NOT_FOUND
ERROR_INVALID_DRIVE
ERROR_NO_MORE_FILES
ERROR_BAD_NETPATH
ERROR_BAD_NET_NAME
ERROR_BAD_PATHNAME
ERROR_FILENAME_EXCED_RANGE

// ENOTDIR
ERROR_DIRECTORY

// EBADF
ERROR_INVALID_HANDLE
ERROR_INVALID_TARGET_HANDLE
ERROR_DIRECT_ACCESS_HANDLE

// ELOOP
ERROR_CANT_RESOLVE_FILENAME

// others
ERROR_NOT_READY
ERROR_INVALID_NAME

eryksun avatar Jan 27 '23 01:01 eryksun

I wouldn't mind if the is_*() methods were changed to suppress all OSErrors. The current selection is undocumented and fairly arbitrary. Users can call stat() if they need an exception to be raised.

barneygale avatar Jan 26 '24 18:01 barneygale

Oh, but we'd need to add optional follow_symlinks arguments to os.path.exists(), isfile() and isdir(), in order to support the equivalent arguments in pathlib methods. Or we'd need to add an islink() check to each of those pathlib methods, which would slow things back down.

barneygale avatar Jan 26 '24 18:01 barneygale

Or switch function called in the pathlib methods based on that parameter.

In newer versions of Windows, stat is going to be almost as fast as the "fast path" functions (and only because we updated the fast path functions to use the faster stat call, or else they'd be slower), so eventually it won't matter too much which call is used. Not following symlinks should always be faster anyway.

zooba avatar Jan 29 '24 13:01 zooba

PR: #118243

barneygale avatar May 08 '24 20:05 barneygale

Resolved in 3.14 / fbe6a09 / #118243. Thanks everyone.

barneygale avatar May 14 '24 17:05 barneygale

Apologize for my confusion, it has been a while since I followed this ticket. I was wondering if someone could clarify a little bit about the situation.

From what I can see, the closing commit/PR only seems to change behaviors about OSErrors. Would it still improve the performance of pathlib.Path.is_file (i.e. what this ticket is about)?

@zooba's comment seems to imply that the improvement was already made at other places, but needing "newer version of Windows".

In other words, if I upgraded to 3.14 in future, but have to use "older" version of Windows (for example. Win10), would I still get any notifiable performance improvement on pathlib.Path.is_file after this patch?

Thanks in advance!

fireattack avatar May 14 '24 18:05 fireattack

From what I can see, the closing commit/PR only seems to change behaviors about OSErrors. Would it still improve the performance of pathlib.Path.is_file (i.e. what this ticket is about)?

The change of OSError handling means the pathlib methods are now compatible with the os.path functions, and so we now call those functions:

https://github.com/python/cpython/blob/fbe6a0988ff08aef29c4649527d5d620d77ca4a2/Lib/pathlib/_local.py#L505-L549

barneygale avatar May 14 '24 18:05 barneygale

Thank you, now I understand.

fireattack avatar May 14 '24 18:05 fireattack