cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-51067: Add `remove()` and `repack()` to `ZipFile`

Open danny0838 opened this issue 8 months ago • 24 comments

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing a str path or ZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when a str path is provided.
  • Returns the removed ZipInfo instance.
  • Supported in modes: 'a', 'w', 'x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode 'a'.

Rationales

Heuristics Used in repack()

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
    2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. https://github.com/python/cpython/issues/51067#issuecomment-1093479250).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.


  • Issue: gh-51067

📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

danny0838 avatar May 24 '25 11:05 danny0838

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app[bot] avatar May 24 '25 11:05 bedevere-app[bot]

It probably would be better to raise an attributeError instead of a valueError here since you are trying to access an attribute a closed zipfile doesn’t have

This behavior simply resembles open() and write(), which raises a ValueError in various cases. Furthermore there has been a change from raising RuntimeError since Python 3.6:

Changed in version 3.6: Calling open() on a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

Changed in version 3.6: Calling write() on a ZipFile created with mode 'r' or a closed ZipFile will raise a ValueError. Previously, a RuntimeError was raised.

danny0838 avatar May 24 '25 17:05 danny0838

Nicely inform @ubershmekel, @barneygale, @merwok, and @wimglenn about this PR. This should be more desirable and flexible than the previous PR, although cares must be taken as there might be a potential risk on the algorithm about reclaiming spaces.

The previous PR is kept open in case some folks are interested in it. Will close when either one is accepted.

danny0838 avatar May 24 '25 18:05 danny0838

Why do you say that? The description gives context to reviewers. (The problem or request should be described in the issue, but the PR description is appropriate to give details on the implementation of the solution.)

When merging, the core dev will put an appropriate commit message.

merwok avatar May 26 '25 00:05 merwok

@sharktide I understand you are trying to be helpful, but please do not instruct contributors on process, especially if that process is not written down somewhere. There's no requirement for a short PR description, in fact, the long description here is a benefit in my opinion (I would like to review but there is a lot of context to consider!).

emmatyping avatar May 26 '25 02:05 emmatyping

Sorry! I didn’t type that. I stepped away from the laptop and my friend who has a bad sense of humor but knows his way around GitHub typed that. Deleting those comments. I just found out from these emails!

sharktide avatar May 26 '25 02:05 sharktide

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned about repack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

emmatyping avatar May 26 '25 06:05 emmatyping

@emmatyping

@danny0838 thank you for sticking with this issue! Would love to see removal support for zipfiles.

I'm a bit concerned about repack being overly restrictive. Perhaps ZipFile could track the offsets of removed entries and use that information to allow for repack to handle arbitrary zips? In this scenario, repack would scan for local file header signatures, and check the offset against the central directory and list of deleted ZipInfos. If it is in neither, data is left as is and scanning continues. If it is in the deleted list, the file is dropped, if it is in the central directory, then it is kept.

Added support of repack(removed) to pass a sequence of removed ZipInfos, and reclaims space only for corresponding local file headers. So one can do something like:

with zipfile.ZipFile('archive.zip', 'a') as zh:
    zinfos = [zh.remove(n) for n in zh.namelist() if n.startswith('folder/')]
    zh.repack(zinfos)

Though the result should not change in most cases, except for being more performant by eliminating the need to scan file entries in the bytes.

danny0838 avatar May 26 '25 16:05 danny0838

Is there still any problem about this PR?

Here are something that may need further consideration/discussion:

1. Should strict_descriptor option for repack() default to True or False?

Summary of trade-offs

Option Pros ✅ Cons ❌
strict_descriptor=False - Correctly strips any entry with an unsigned data descriptor
- Better strict to ZIP spec
- ~1000× slower in worst cases
- Might open a hole for DoS (if attacker crafts offensive entries)
- Slightly higher false-positive risk on random bytes
strict_descriptor=True - Much faster
- Safer against DoS scenarios
- Lower false-positive risk
- Cannot strip any entry with an unsigned descriptor
- Less strict to ZIP spec (but doesn't violate it)

Background

When a local file entry has the flag bit indicating usage of data descriptor:

  1. This method first attempts to scan for a signed data descriptor.
  2. If no valid one is found:
    1. For supported compression methods (ZIP_DEFLATED, ZIP_BZIP2, or ZIP_ZSTANDARD), it decompresses the data to find its end offset.
    2. Otherwise it performs a byte-by-byte scan for an unsigned data descriptor.

This option only affects case 2.2, which is used when neither signed descriptor nor decompression-based validation is applicable.

Performance comparison

Based on the benchmark (see tests in test_zipfile64.py):

  • 8 GiB ZIP_STORED file with signed data descriptor: ~56.7s
  • 400 MiB ZIP_STORED file with unsigned data descriptor: ~3166s

The latter is around 1000× slower due to the byte-by-byte scanning for a valid data descriptor.

This may also raise a security concern since strict_descriptor=False may open a path for a DoS Attack (if an attacker crafts a ZIP file with offensive entries).

False-positive risk

It's not possible to guarantee the "real" file size of a local entry with a data descriptor without the information from the central directory.

If a local file entry spans 100 MiB, it's theoretically possible that multiple byte ranges (e.g., the first 20 MiB, 30 MiB, etc.) could each appear as valid data + data descriptor segments with differing CRCs and compressed sizes. (Currently, the algorithm validates only the compressed size. Checking for CRCs could reduce false positives but would significantly deteriate performance.) The byte-by-byte validation can increase the risk of a false positive compared to the signature or decompression based validation, which only checks for certain points and has more prerequisites.

The scanning of data descriptor should be very unlikely to happen on random bytes since it must has the data descriptor bit set and has CRC=compressed_size=file_size=0.

A false positive should be unlikely to happen in practice. If it were to happen, a stale local file entry is stripped incompletely (e.g. a 30MiB entry be treated as 20MiB, leaving 10MiB random bytes over) and cause following entries not stripped (since the algorithm requires consecutive entries). However, the ZIP file remains uncorrupted.

Spec compliance

According to the ZIP file format specification (since version 6.3.0, released on 2006-09-29): Applications SHOULD write signed descriptors and SHOULD support both forms when reading.

Unsigned descriptors are thus considered legacy, but it is unclear whether they are still used widely.

strict_descriptor=True adheres less strictly to the spec, but does not violate it — because stripping is neither reading nor writing, and a suboptimal stripping does not corrupt the ZIP archive.

2. Should we also implement copy() for ZipFile?

Currently, copying an entry within a ZIP file is cumbersome due to the lack of support for simultaneous reading and writing. The implementer must either:

  • Read the entire entry and write afterwards (which is memory-intensive and inefficient for large files), or
  • Use a temporary file for buffered copying.

Both approaches are more complex and less performant, due to the need to decompress and recompress data.

If would be much more performant and friendly by implementing a copy() method, using the similar internal buffered copying technique that _ZipRepacker has used.

Additionally, this also opens the door to support an efficient move() operation, composed of copy(), remove(), and optionally repack().

And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?

The current draft for copy() support can be found here.

danny0838 avatar May 31 '25 00:05 danny0838

A higher level question: Why would we want to maintain advanced features that are not used by most people within the standard library at all? This code existing in the stdlib creates a maintenance burden and is the slowest possible way to get features and their resulting trail of bugfixes to people who might want them. All features have an ongoing cost.

Is there a real need for these zipfile features to not simply be advanced ones available in a PyPI package? @jaraco FYI

I appreciate the enthusiasm for implementing something interesting. I'm not sure we actually want to maintain that within the CPython project though. Are there compelling use cases for these features to be part of Python rather than external?

(edit: posted this on the Issue, leaving here for posterity)

gpshead avatar May 31 '25 03:05 gpshead

@gpshead Don't you think your question should be raised in the issue thread rather than here? 🫠

danny0838 avatar May 31 '25 04:05 danny0838

good point, moved, thanks! (where to have what discussions is an area where I consider github's UX to be... not great)

gpshead avatar May 31 '25 15:05 gpshead

@emmatyping Still pending review... I'd like to know if there is still any problem about this PR?

danny0838 avatar Jun 13 '25 07:06 danny0838

Unfortunately it's premature to move forward with this PR until the discussion in the issue is resolved. There are open questions about whether we should add this API or not, and if we do what it would look like. That needs to be worked out in the issue before this PR can move forward, sorry!

emmatyping avatar Jun 13 '25 17:06 emmatyping

Quickly released as zipremove PyPI package that supports Python >= 3.9.

I'm tired of doing all the migration jobs and dealing with all the compatibility crabs over and over again. This work will be migrated to that package and updating of this branch will be halted, until folks are sincerely going to merge into the standard library.

danny0838 avatar Jun 14 '25 17:06 danny0838

And an additional question: whether this should be included in the current PR, or proposed separately as a follow-up?

This sounds like another feature, but it does sound useful. Perhaps it should have its own issue (if it doesn't already). I'd say let's limit the scope here to addressing the linked issue.

jaraco avatar Jun 15 '25 17:06 jaraco

Should strict_descriptor option for repack() default to True or False?

I don't love that there's two implementations here. My reading from the code is that unsigned descriptors are deprecated, so maybe it would be enough to simply omit support for them, opt for the faster implementation, and (maybe) warn if such a descriptor is encountered that it's unsupported. How prevalent are such descriptors?

jaraco avatar Jun 15 '25 17:06 jaraco

Should strict_descriptor option for repack() default to True or False?

I don't love that there's two implementations here. My reading from the code is that unsigned descriptors are deprecated, so maybe it would be enough to simply omit support for them, opt for the faster implementation, and (maybe) warn if such a descriptor is encountered that it's unsupported. How prevalent are such descriptors?

Actually there are three implementations. 😂

Therefore the brief algorithm/condition for the slow scan is:

  1. removed is not provided.
  2. Has a local file header like byte sequence (with PK\x03\x04 signature, flag bit 8, and CRC == compress_size == file_size == 0), which should be very unlikely to happen on random bytes.
  3. A valid signed data descriptor is not seen.
  4. A supported compression method (which can reliably detect stream ending) is not used. (STORED/LZMA) (not sure if LZMA decompressor can be reworked to support it)
  5. strict_descriptor==False

It should be quite unlikely to happen even if strict_descriptor==False, but the problem is still that it may be catastrophically slow once it happens, and could be used intentionally and offensively.

The prevalence of the deprecated unsigned data descriptor is hard to tell. Most apps like WinZip, WinRAR, 7z would probably write no data descriptor since they are generally not used in a streaming condition. For streaming cases I think most developers would simply use the ZIP implementation at their hand, and to answer this question we'd have to check the ZIP implementation of popular programming languages like C, Java, JS, php, C#, etc.

Since this feature has been already implemented, I'd prefer to keep it unless there's a strong reason that it should not exist. Defaulting strict_descriptor to False should be enough if it's not welcome.

A minor reason is that we still cannot clearly say that unsigned data descriptor is not supported even if the slow scan is removed, due to the decompression approach. But maybe another choice is that the decompression approach should also be skipped when strict_descriptor==True.

danny0838 avatar Jun 15 '25 18:06 danny0838

Reworked strict_descriptor=True to ignore any scan of unsigned data descriptors. (https://github.com/danny0838/zipremove/commit/1a07dffcd50adc772dcfaf1e795d391a63f4e16d)

danny0838 avatar Jun 16 '25 12:06 danny0838

@danny0838 in reply to https://github.com/python/cpython/issues/51067#issuecomment-2985304746 (sorry about fragmenting the discussion) -

I'm generally supportive of the iterative approach of start simple and expand from that.

Having said that, since you already implemented _remove_members, I don't see why not expose it through a public API, given that ZipFile has other examples for functions that receive a list of members (members arg of extractall) - can simply follow the same pattern for consistency. I do agree it's much more useful - my own local implementation only removes a single member and I do see the performance benefits of having an ability to remove multiple members, as many times I myself need to remove multiple members in many different files and it would be much beneficial.

I'm not fond of the idea of a 2 step removal (remove/repack), since it can leave the file in state that some may see (me included) as corruption - zip files aren't support to have a CD entry without a corresponding LF entry. I'd rather take an approach that limits certain things (cannot remove a file from a self extracting archive for example), and I don't see the large benefit of delayed repacking flexibility, seems like rather an edge case (but I might be wrong here).

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach. If the 2 step approach has more support here, I'll take it - we can always improve it in the future, but I'd prefer to keep things simpler when possible.

And last but not least - perfect is the enemy of good - it's not the last python version, and we can always improve as long as the initial version is reasonable, which I believe it is.

distantvale avatar Jun 18 '25 19:06 distantvale

@distantvale

Having said that, since you already implemented _remove_members, I don't see why not expose it through a public API, given that ZipFile has other examples for functions that receive a list of members (members arg of extractall) - can simply follow the same pattern for consistency.

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

There are stil design considerations even for your proposed members, for example:

  1. Should there be remove plus removeall, or a remove that accepts either a single entry or a list?
  2. What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?
  3. What to do if any passed value causes an error (such as non-exist)?

The current implementation of _remove_members actually defers such decisions. This is fine for a private helper method but is probably premature for a public API.

I'm not fond of the idea of a 2 step removal (remove/repack), since it can leave the file in state that some may see (me included) as corruption - zip files aren't support to have a CD entry without a corresponding LF entry.

I don't get this. remove leaves a LF entry without a corresponding CD entry, which is clearly allowed by the ZIP spec and won't be complained by any ZIP app.

I'd rather take an approach that limits certain things (cannot remove a file from a self extracting archive for example), and I don't see the large benefit of delayed repacking flexibility, seems like rather an edge case (but I might be wrong here).

A delayed repacking isn't necessarily done after the archive is closed, it can also happen simply after the method is called and returned.

For example, a repacking after a mixed operations of writing, removing, copying, renaming, and just before archive closing, such as an interactive ZIP file editing tool would do.

It's probably challenging enough to design a one-time remove or removeall to support all such cases.

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach.

I don't think there's too much need to dig into such details for low level APIs—just let them work simply like other existing methods. It's not the case for a one-time high level remove—and maybe copy and rename—though.

danny0838 avatar Jun 19 '25 03:06 danny0838

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent. If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

There are stil design considerations even for your proposed members, for example:

  1. Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

  1. What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

  1. What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

The current implementation of _remove_members actually defers such decisions. This is fine for a private helper method but is probably premature for a public API.

I don't get this. remove makes a LF entry without a corresponding CD entry, which is clearly allowed by the ZIP spec and won't be complained by any ZIP app.

Ok, fair enough - it just seems less desireable/clean to me, but I might be a minority here.

A delayed repacking isn't necessarily done after the archive is closed, it can also happen simply after the method is called and returned.

For example, a repacking after a mixed operations of writing, removing, copying, renaming, and just before archive closing, such as an interactive ZIP file editing tool would do.

Yes of course, but it might also be called after the file is closed. I can see some performance benefits, but it would seem like something that a low level API would provide, as opposed to a high level one. But again, I won't die on that hill, if most people find it useful, so be it - either way it's a good way to start from my end.

It's probably challenging enough to design a one-time remove or removeall to support all such cases.

Also some of the questions (how to handle a folder) are relevant regardless of whether we take the 1 step or 2 step approach.

I don't think there's too much need to dig into such details for low level APIs—just let them work simply like other existing methods. It's not the case for a one-time high level remove—and maybe copy and rename—though.

I'm fine with a "naive" API as well, documentation is enough for these cases at this point IMO.

distantvale avatar Jun 19 '25 07:06 distantvale

@distantvale

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent. If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

The truth is that they are not consistent. When you say they should be consistent, do you mean there should be multiple file version for open('w'), write, write_str, and mkdir, or there should be no multiple file version for extractall, or every of them should be a single method that supports both single and list input?

There are stil design considerations even for your proposed members, for example:

  1. Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

  1. What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

Actually there is no ZipFile.add. Do you mean write and others?

  1. What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

In extractall each input is handled one by one, and any error causes subsequent inputs not handled. However the current _remove_members handles removing and repacking simultaneously. If any error happens in the middle, the whole repacking is left partially done and the archive will be in an inconsistent state, which is unlikely an acceptable consequence.

Likewise, for extractall providing duplicated names just extract the same entry to the same filesystem path again; for removing this would be a totally different story.

Anyway, the current _remove_members implementation DOESN'T do what you state above. If you favor that approach and really want such behavior, you'd have to work on that PR, doing more tests and feedback (I'm probably not going to keep working on that unless it's the final decision of the issue). It would also be nice if you can provide the code of the implementation you've been using.

danny0838 avatar Jun 19 '25 13:06 danny0838

@distantvale

@danny0838 I can share my own opinions about the desired behavior but of course not everyone would agree:

I don't think extractall is a good analogy since it's read-only. Instead you should check for open('w'), write, write_str, and mkdir, each of which is a simple single entry handler.

I don't see the need to separate into read/write, as I believe they should all be consistent. If certain functions, such as extract, have 2 versions - for single/multiple files, then it won't be an exception in the API, and if we find it useful, have both.

The truth is that they are not consistent. When you say they should be consistent, do you mean there should be multiple file version for open('w'), write, write_str, and mkdir, or there should be no multiple file version for extractall, or every of them should be a single method that supports both single and list input?

What I'm saying is that there's already precendence in the package for a function that has 2 versions, so it's not out of the ordinary to add another one.

There are stil design considerations even for your proposed members, for example:

  1. Should there be remove plus removeall, or a remove that accepts either a single entry or a list?

I'd follow extract in this case.

  1. What to do if identical names are passed? Should they be treated as a set and deduplicated, or be removed for multiple times?

I might be wrong here, but I think add allows adding multiple files with the same name, so in that case I'd remove multiple times. If I'mt wrong about add, I'd treat it as a set.

Actually there is no ZipFile.add. Do you mean write and others?

Yup, I mean write.

  1. What to do if any passed value causes an error (such as non-exist)?

Just like in extractall, the names must be a subset of namelist().

In extractall each input is handled one by one, and any error causes subsequent inputs not handled. However the current _remove_members handles removing and repacking simultaneously. If any error happens in the middle, the whole repacking is left partially done and the archive will be in an inconsistent state, which is unlikely an acceptable consequence.

The behavior of extractall might not be aligned with the documentation that states that members must be a subset of namelist() - hard to say what the intention was, but I think it's a legitimate check. I agree re. the inconsistent state, and maybe that's the most significant reason to keep the external API simple for now. Once the initial version is release, expanding it to more functionality is easier.

Likewise, for extractall providing duplicated names just extract the same entry to the same filesystem path again; for removing this would be a totally different story.

Yeah I agree, and since multiple files with the same names are not very common, I'd opt for treating members as a set.

Anyway, the current _remove_members implementation DOESN'T do what you state above. If you favor that approach and really want such behavior, you'd have to work on that PR, doing more tests and feedback (I'm probably not going to keep working on that unless it's the final decision of the issue). It would also be nice if you can provide the code of the implementation you've been using.

My implementation is based on the original PR, so not much there in regards to these options.

In any case I'd proceed with your approach for the time being.

distantvale avatar Jun 19 '25 14:06 distantvale