compressed_map icon indicating copy to clipboard operation
compressed_map copied to clipboard

Doesn't compress small maps very well

Open bitwiseshiftleft opened this issue 3 years ago • 0 comments

This library achieves good compression on large maps, but on small ones it's less impressive.

The main culprit here is padding in CompressedRandomMap: since the bits of the output are compressed in parallel, there is separate padding (up to some 40 bits = 5 bytes) for each bit of output. So in addition to headers and magic numbers, a 1/256 ApproxSet wastes up to 40 bytes on padding.

The best solution on this is probably for future research, but the obvious tweak is that for small maps we don't need to compress the bits in parallel, so that there is only one set of padding instead of 8. A simple thing to try is that should still have good performance is, instead of starting at word (valueBits * (hash % nblocks)), to start at word (hash % (valueBits * nBlocks)). This hopefully has a high solution probability and fast access times. Constructing the map requires solving a system that's valueBits times larger.

The straightforward implementation would require a bounds / wrap check on each access. However if this is only used when the map is small (say < 1kB or even smaller) then we can just copy it, and append enough of the beginning to the end that we avoid the wrap check.

Other possibilities include a smaller block size and a smaller padding size. With small blocks, solution probabilities drop off for large maps, but for small ones it should be fine.

bitwiseshiftleft avatar Dec 28 '22 13:12 bitwiseshiftleft