Optimal huff depth speed improvements
TLDR
This PR is a followup to a previous PR that, for high compression levels, brute force tests all valid Huffman table depths and chooses the optimal based on minimizing encoded + header size. This PR introduces some speed optimizations that could (potentially) allow this feature to be used at lower compression levels.
Note: This does cause us to lose a bit in terms of compression ratio, but offers a significant speed improvement. This could mean that we want more fine-grained depth modes, but that could be overkill.
Benchmarking
I benchmarked on an Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz machine with core isolation and turbo disabled. I measured compression time and compression ratio for silesia.tar (212M), compiling with clang15. I experimented with various combinations of compression level and chunk size. I ran each scenario 5 times and chose the maximum speed value.
Speed-Optimized Optimal Log vs. No Optimal Log
In the following tables, ctrl uses the original huffman log method with no speed or compression optimizations, and test uses the speed-optimized huffman log method.
Default block size
-B1KB
-B16KB
Speed-Optimized Optimal Log vs. Brute Force Optimal Log
In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.
Default block size
-B1KB
-B16KB
In the comparison above, what is dev ?
In the comparison above, what is
dev?
Apologies, dev refers to the original dev branch, without any sort of log optimizations. I will add additional tables comparing this speed optimization to my original PR.
While the compression losses are small, they are nonetheless present, in large enough quantity to question the "speed benefit at no compression loss" initial objective.
My understanding is that this PR starts by "guessing" what is likely a good optLog value, and then test left and right to see if there are better ones.
My concern is that managing the left/right regression logic may be a bit more complex than it initially seems. Of course, the resulting format is always correct, the concern is about finding the best choice, as the brute-force method does.
A recommendation here would be to simplify the logic by doing only a single direction : from smallest to larger. See if this method results in any loss of compression (if it does, then there is more to look into). See if it improves speed measurably.
The intuition is that it will help speed for small data, which is where it matters most because the cost is perceptible. It will probably not be efficient for large data, but also, the relative cost is much lower in this case.
Finally, presuming it works (to be proven), it would be a good step forward, a reference point that could still be improved upon.
After many rounds of experimentation, I unfortunately could not find a solution that fits the initial objective to achieve speed improvements with no loss in ratio. Any potential speed improvements that deviate from brute force seem to at least somewhat regress compression ratio for 1KB block sizes (even if only by a couple hundred bytes).
The solution proposed by @Cyan4973 appears to be the best solution for now, though still not quite optimal.
Speed-Optimized Optimal Log vs. Brute Force Optimal Log
In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.
Default block size
-B1KB
-B16KB
There remains some loss in compression ratio for smaller block sizes, though the loss is fairly negligible.
I implemented the suggestions made by @Cyan4973 and observed favorable results.
Note: The losses we see in decompression ratio appear to be random noise. I ran the benchmarks additional times and did not see the loss in speed observed here.
Speed-Optimized Optimal Log vs. Brute Force Optimal Log
In the following tables, ctrl uses the brute force optimal log method, and test uses the speed-optimized optimal log method.
Default block size
-B1KB
-B16KB