Support bijection for uncompressed archive formats to facilitate deduplication

Open Intensity opened this issue 9 years ago • 0 comments

In addition to improving the ratio for a final recompression pass, precomp by nature also improves deduplication ratios across a dataset since the content may coexist in both uncompressed and compressed forms. So there's a benefit to deduplication ratio (across a long range data stream/set) as well as compression.

That said, I'd like to propose that precomp support a bijective transform on an input (stream) which is applied even to an uncompresed archive - whether that's raw format, a tarfile, cpio/ditto, a ZFS send stream, zip -0, or an excerpt from an uncompressed vmdk using a few popular filesystem formats.

This is proposed because it would further precomp's ability to normalise data and prepare it for efficient deduplication and compression. Variable-block deduplication processing can be done by tools like ddar (great since negligible memory needed), srep (more efficient than lrzip), and pcompress, as an example. In principle, I could set the (average) block size for deduplication small enough so that more blocks are considered equivalent and thus deduplicated. But that's more than what's needed. To do that completely might involve multiplying the resource requirements by a factor of ten. Analogously, if each of those tools chose a large enough blocksize, then the periodic presence of excess bytes in the stream (constituting the internal format) wouldn't cause as big of a problem. But a large blocksize is not always chosen, so this wouldn't be sufficient either.

What if the tar/cpio format, ZFS stream, or VMDK/filesystem stream was reversibly altered in such a way that a deduplication pass on the resultant hybrid data stream would consider these multiple representations as nearly equivalent? Suppose for example that I have a tarfile and I've also unpacked the raw files. If I create a cpio file of the whole tree, including all of that, I now have multiple representations of the same data. I'd like to not have to store this twice. It's easier to notice when everything is in the same subdirectory, but suppose there are a few times data is replicated. Also, ideally this idea ought to apply recursively. A cpio file containing similar tarfiles should be just considered as a nested structure referencing some identical files.

The more that data is expressed in its most native format and the extent to which a canonical representation can be derived, the better the deduplication and compression. I think that most formats mentioned could be translated between each other as part of a stream processing with bounded memory. An imperfect transformation that encodes the side information would also suffice as long as the transform is reversible. In theory something like xdelta3 with just the right settings would help; maybe there's a way to generalise the idea without considering the particulars of a multitude of archive formats.

Jun 04 '16 02:06 Intensity