cpython icon indicating copy to clipboard operation
cpython copied to clipboard

Suffix in pathlib is not behaving like a file extension

Open bbilly1 opened this issue 1 year ago • 7 comments

Bug report

Bug description:

According to the docs here a suffix is defined as:

The file extension of the final component, if any

But pathlib doesn't behave as expected, illustrating that on Python 3.12.4:

>>> Path("Mr. Smith resume for review").suffix
'. Smith resume for review'

This is particularly problematic for methods like with_suffix. According to the docs here that should:

Return a new path with the suffix changed. If the original path doesn’t have a suffix, the new suffix is appended instead. If the suffix is an empty string, the original suffix is removed

But as established above, that results in:

>>> Path("Mr. Smith resume for review").with_suffix(".pdf")
PosixPath('Mr.pdf')

I've characterized that as a bug as that is not working as described, but could also be a matter of documentation improvement.

I see a few ways:

  1. Implement add_suffix, or an argument to with_suffix, but I know that has been discussed before and was ultimately decided against. Also this is a matter of how suffix is defined, not how it's processed.
  2. Sanity check what a suffix can be, e.g. suffix can't have white space. But ultimately difficult without making assumption.
  3. Adjust the documentation: Clarify that suffix is not equal to file extension but something like last segment separated by a dot or similar. Maybe even with a warning that this can have unexpected behavior as showcased above.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

bbilly1 avatar Jul 03 '24 18:07 bbilly1

How are you defining "file extension"?

barneygale avatar Jul 04 '24 07:07 barneygale

Common denominator I see is: "The file extension defines what kind of file it is."

Some definitions I have found:

  • Wikipedia: A filename extension, file name extension or file extension is a suffix to the name of a computer file (for example, .txt, .docx, .md). The extension indicates a characteristic of the file contents or its intended use.
  • microsoft: Windows file names have two parts separated by a period: first, the file name, and second, a three- or four-character extension that defines the file type. In expenses.xlsx, for example, the first part of the file name is expenses and the extension is xlsx.
    Extensions tell your computer which application created or can open the file and which icon to use for the file. For example, the docx extension tells your computer that Microsoft Word can open the file and to display a Word icon when you view it in File Explorer.
  • lenovo: A file extension is a three or four-letter code that appears at the end of a filename and indicates the type of file it is. For example, .txt stands for text files, .jpg stands for image files and .docx stands for Microsoft Word documents. By knowing what kind of file, it is, your computer will be able to correctly open the file using the correct program.
  • wordnik: noun A short series of letters and/or numerals at the end of a personal computer filename, used to indicate the type of file and the software that will be required to operate or open it.
  • techterms: A file extension (or simply "extension") is the suffix at the end of a filename that indicates what type of file it is. For example, in the filename "myreport.txt," the .TXT is the file extension. It indicates the file is a text document. Some other examples include .DOCX, which is used for Microsoft Word documents, and .PSD, which is the standard file extension for Photoshop documents.

bbilly1 avatar Jul 04 '24 08:07 bbilly1

Common denominator I see is: "The file extension defines what kind of file it is."

That's only true on Windows. On other operating systems, file extensions are an indicator and nothing more. I can rename an .mp3 file to .jpg on Linux and play it just fine.

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

barneygale avatar Jul 04 '24 08:07 barneygale

The point I'm getting at is that there is no standard definition of a file extension!

Since #82805 was solved, pathlib's suffix splitting works exactly like os.path.splitext(). A non-empty suffix starts with a dot and contains at most one dot, and a non-empty suffix must be preceded by a stem that contains at least one non-dot character.

barneygale avatar Jul 04 '24 09:07 barneygale

I get what you are saying. But even on Linux, that will depend on the implementation. E.g. xdg-open will happily ignore the file extension. But common GUI file browsers will not (tested on Thunar).

As there is no standard definition, I'd suggest to either define it, or avoid using the term all together and use the implementation as definition, e.g. last segment separated by a dot.

Even though it's not authoritatively defined, I'd argue, the above with_suffix example is unexpected behavior.

bbilly1 avatar Jul 04 '24 09:07 bbilly1

Microsoft's definition ("three- or four-character extension") excludes some valid extensions like .a, .so and .patch, but fails to exclude file extensions with spaces such as . Smith resume for review

The Windows shell API currently supports permanently associating a programmatic identifier (ProgID) with any file extension that does not include white space characters and that has a length from 1 to 198 characters (not including the dot). Thus the API supports ".a", ".so", and ".patch" as normal file extensions. If a file has no extension, or if the extension is longer than 198 characters or contains white space characters, then the API displays an open-with dialog that allows opening the file with an application just once instead of setting a permanent association.

eryksun avatar Jul 04 '24 10:07 eryksun

Hum, that does give some legitimacy to the idea of forbidding whitespace in file extensions in pathlib.

barneygale avatar Jul 04 '24 11:07 barneygale

I think ideal behavior would be, that if you are trying to create a suffix with whitespace, throw an error. If you are trying to access a suffix, if there is whitespace, it's not a suffix but just a regular part of the filename.

bbilly1 avatar Jul 06 '24 15:07 bbilly1

The existing behaviour has some advantages:

  1. It's what existing users have come to expect (incumbent advantage)
  2. It's consistent with os.path.splitext()
  3. It's consistent between Windows and non-Windows paths when operating on a basename

So I'm weakly -1 on this idea, but happy to hear opinions from others.

barneygale avatar Oct 13 '24 18:10 barneygale

It's what existing users have come to expect

I respectfully disagree. I'd argue based on the with_suffix example above, this is unexpected. At least it was for me, so obviously anecdotal, but I'm an existing user. :-)

But yes, as an experienced developer you and I know about that pitfall, so we can account for it. Probably avoid using with_suffix and use string concatenation to append the file extension. Or check for any existing . dots in the filename before using with_suffix.

As with with_suffix it is super easy to shoot yourself into the foot. Bad code to illustrate my point:

from pathlib import Path

files = [
    Path("Mr. Smith resume for review v1"),
    Path("Mr. Smith resume for review v2"),
    Path("Mr. Smith resume for review v3"),
]

for path in files:
    path.rename(path.with_suffix(".pdf"))

Now you'll have deleted/overwritten v1 and v2 and only have v3 that is now called Mr.pdf.

But I just checked, it looks like the docs were updated in the mean time: https://github.com/python/cpython/pull/106650.

So that wording is much better and describes the behavior more accurately.

bbilly1 avatar Oct 14 '24 03:10 bbilly1

I have encountered the same issue with pathlib.PurePath incorrectly handling stem and suffix in cases where the filename includes a dot followed by spaces, which is not a valid file extension (or at least should not be). Here's an example:

# Python 3.13.1

from pathlib import PurePath

path = PurePath("Nirvana/Nevermind/01. Smells Like Teen Spirit")

print(path.name)
# Output: 01. Smells Like Teen Spirit

print(path.stem)
# Output: 01
# Expected: 01. Smells Like Teen Spirit

print(path.suffix)
# Output: . Smells Like Teen Spirit
# Expected: '' (empty string)

print(path.with_suffix(".flac"))
# Output: Nirvana\Nevermind\01.flac
# Expected: Nirvana\Nevermind\01. Smells Like Teen Spirit.flac

While suffix method works for the majority of cases, it fails in edge cases like this one, where the "extension" includes spaces, which should not be allowed.

Dealing with the problem:

from pathlib import PurePath

path = PurePath("Nirvana/Nevermind")

# add file name
path /= "01. Smell Like Teen Spirit"

# can not set file extension like this
# path += ".flac"

# this creates new dir /.flac
# path /= ".flac"

# this outputs Nirvana\Nevermind\01.flac
# path.with_suffix(".flac")

# need to use full name with extension
path.with_name("01. Smell Like Teen Spirit.flac")

It would be ideal if pathlib could be updated to handle these edge cases more intuitively, especially when the "extension" contains spaces.

There is cool take about file extensions on stackoverflow.

We should redefine current extension definition to match this case or provide better methods for path.

// edit Workaround example in my project. It works but is kinda weird.

oskvr37 avatar Jan 27 '25 14:01 oskvr37

@zooba Eryk noted above that the Windows shell API doesn't allow spaces in file extensions, nor single-dot extensions. Do you think it's worth changing the behaviour of PureWindowsPath.stem/suffix to match this? If so, should we also change ntpath.splitext()? Perhaps even the POSIX implementation? Thanks!

barneygale avatar Apr 10 '25 23:04 barneygale

I haven't thought through yet, but I think this is okay provided scenarios that people may already be doing to work around it aren't affected (e.g. if someone is checking p.suffix.startswith(". ") to choose between p.name + ".ext" or p.with_suffix(".ext") then the updated behaviour would still be correct, even though strictly speaking the behaviour changes).

Enumerating as many of these scenarios as we can think of will help decide whether it's a safe enough change.

(And I hope it is. Certainly a leading space on an extension is a clear indicator that it's not intended as an extension, and perhaps spaces anywhere is also a good enough case.)

zooba avatar Apr 11 '25 11:04 zooba