support optional crc32 for uncompressed streaming zip32 and zip64
This PR implements the idea originally discussed in #17 and #58, producing ZIP files with actual length in local header and 0 crc32, and including a data descriptor with the length and actual crc32.
This allows specifying ZIP file members with NO_COMPRESSION_64(file_size, 0) and NO_COMPRESSION_32(file_size, 0) and does not raise the invalid crc32 exception, but instead computes it and stores it in the data descriptor.
The ZIP files produces with this implementation should:
- Pass
unzip -ttest - Be stream unzippable with
stream_unzip(according to my testing) without any additional changes. I believe this was the main objection to the ideas in #58, since discussion was about making both length and crc32 optional, now only, crc32 is.
This would really help our use case to be able to support this without having a custom fork.
The main use case is being able to stream-zip files from S3-like buckets where the size is available, but crc-32 usually is not (also mentioned in #17)
Interesting... will have more of a look around and ponder. Some initial thoughts/questions/requests on this:
-
Does it produce ZIP files that are not valid according to the spec? From https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT:
If this bit is set, the fields crc-32, compressed size and uncompressed size are set to zero in the local header. The correct values are put in the data descriptor immediately following the compressed data.
I'm not sure right now if we should go against the spec...
-
I think zero is a valid CRC_32, so if some sort of sentinel value were to be used to change behaviour/structure, it should be something else.
None? -
The pattern of using the CRCActual object to get data out from _no_compression_streamed_data: this isn't consistent with the very similar case of getting data out from _zip_data. I think _no_compression_streamed_data should return the actual crc_32, just like _zip_data does. So then the value could be retrieved with something along the lines of
actual_crc_32 = yield from encryption_func(_zip_data(.... -
And can we test far more: not just stream-unzip, but Python's ZipFile, unzip, bsdcpio/libarchive, 7zip, and AES encrypted versions as well where possible. If this is against the spec, I would say extra important
My biggest concern is the spec thing...
To communicate, I am more and more anti this because it results in ZIP files that do not adhere to the spec. Even if we test a load of existing unzippers, it's not very friendly to future unzippers that expect files to adhere to the spec, or even new versions of existing unzippers that make changes expecting them to be fine because of the spec