Writing an archive can be inefficient because of seeks
Description A new entries in an archive is written by:
- writing a preliminary directory entry
- writing file contents
- seeking backward before the directory entry
- writing the real directory entry
- seeking forward to skip the file
This can be quite inefficient for an archive consisting of many small files if the archive’s zip_source_t flushes its buffers for seeking and has high latency of flushing the buffer, e.g. when working with a remote file abstraction.
We should be able to do better if ZIP_SOURCE_STAT provides enough information to write the correct directory entry immediately.
Even if libzip does not trust the crc provided by the entry’s zip_source_t, it can write the preliminary entry with the provided crc, and later check if the directory entry needs to be updated or not.
Solution
Avoid write seeks when possible (seeking to the current position is OK), i.e. if ZIP_SOURCE_STAT provides enough information to write the correct directory entry before writing file contents.
Update:
It is sometimes possible for seeks near the current position to be performed within the buffer, and not to cause flushing of the buffer. I made such change in https://github.com/google/riegeli/commit/1d03ff7b5336718f4ed38522e0cbff2ed62c9fab.
But this does not help when the entry crosses a buffer boundary. Also, POSIX fseek() is specified to flush buffered data to the file, so this does not help when ZIP_SOURCE_SEEK_WRITE maps directly to fseek().
Comparing the dirent we wrote before writing the file data and only seeking and re-writing it if it changed sounds like a good optimisation. We'll implement it when we get to it, or you could give it a try and submit a pull request.
Seeking within the file buffer is beyond what we are willing to do.
I do not expect libzip to seek within the buffer, since buffering is applied by the implementation of the source, i.e. in general by the client of libzip. For zip_source_file_create() buffering is delegated to stdio and this is infeasible: fseek() is documented to always flush the buffer as a side effect.
I mentioned this as a partial mitigation which can be performed by the source. This mitigation is useful while libzip does seeks, and also for cases where the seek cannot be eliminated because the initial dirent does not contain all information (e.g. crc). This mitigation is partial because it is infeasible if the file crossed buffer boundaries.
The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per https://github.com/nih-at/libzip/issues/359), the dirent can be rewritten if the CRC turns out not to match.
The optimization should tentatively trust the client-supplied CRC for the initial dirent. Even if you prefer to recompute CRC for uncompressed sources (per #359), the dirent can be rewritten if the CRC turns out not to match.
Yes, that's the plan.
Alternatively, one can write a 'streamed' ZIP, then seeking wouldn't be necessary. See #378.