ribzip2 Performance of compression: The clone wars

ribzip2 uses a very naive representation of Huffman codes and also writes them in a naive way: it uses dynamically allocated arrays of enums. Also at other places the habit of representing bits as arrays of enums has large costs, in total at least 5% are spent during clone operations of these arrays. There are basically these places where this can be eliminated by a better internal representation, e.g. a 32 bit integer (bzip2 Huffman codes are length-limited to 17 bits writing and 20 bits reading anyway).

[ ] replace representation of huffman codes by 32 bits integers to avoid cloning of arrays
[ ] replace bitwriter internal representation by bytes or integers
[ ] use bit array (represented by integers, for examples) instead of arrays of enums
[ ] store block data more efficiently instead of just using arrays of Bit enums

Jan 09 '22 22:01 torfmaster

Hey @torfmaster. I wanted to take a shot at this, if this is still open.

Mar 24 '22 12:03 pdogr

Yes it is. Go ahead!

Mar 24 '22 17:03 torfmaster

torfmaster - greetings. I saw your project just a couple days ago, and am impressed with your progress. My son, Micah Snyder, is the current custodian of the bzip2 repository.

We have been talking about writing a Rust version. I'm somewhat new to Rust, and have been cutting my teeth on a bzip2 implementation (private repository). Just last night I actually achieved very similar compression level's to Julian's code - was so happy. I'll pull your code and look at it. (The solution for me was recreating a "partial sort" when optimizing the Huffman tables, patterned off Julian's code.)

Apr 01 '22 21:04 ohsnyt

@ohsnyt I'm very happy, that my work (more than a year in my spare time) gets some attention :) By the way I achieved a similar compression level to the original implementation by using a more or less classical k-means-clustering approach. You can enable this using the command line parameter (or inside the lib set the appropriate flag). However, it is a bit slower by now and I didn't enable it by default. To be honest I was unable to understand the original implementation of the optimization of the Huffman tables - if you have any hint why it works, give me a hint :)

I would be happy to bring this code to a broader audience, if you have any ideas, how - feel free to tell me!

Apr 02 '22 10:04 torfmaster

My GitHub skills are not great (yet). I tried create a fork and publish it back up. It is heavily commented. Look there and see if that helps. (If the code didn't push up, let me know.) The new code has a new subcommand for "standard" bzip2 encoding.

Since Julian's code computes Huffman codes so differently, I decided to make a looooong match branch at line 53 of block_encoders.rs. I didn't change much of any code before that point (except some CLI stuff to get the new option up). I didn't touch the decoding side.

Apr 05 '22 13:04 ohsnyt

@ohsnyt thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it. One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach.

Apr 08 '22 06:04 torfmaster

Sorry for the delay in replying. I am still working some bugs out of a larger import. I’m getting closer, but finding in initial testing that the speeds are very similar.

I’ll be in touch more later.

David

On Apr 8, 2022, at 1:39 AM, torfmaster @.***> wrote:

@ohsnyt https://github.com/ohsnyt thank you! If you create a PR and tick "allow edits from maintainers" I can try to integrate this into the existing code. Still I am unsure why this code (I mean the original bzip2 code) works at all, but I'll benchmark it and if it shows to be more efficient than the k means clustering method I'll integrate it. One question regarding your port. Have you tried to benchmarked your code with large files? I have used the same approach (i.e. fat pivot radix sort) initiallly but it has shown much slower than the now-used SA-IS approach.

— Reply to this email directly, view it on GitHub https://github.com/torfmaster/ribzip2/issues/8#issuecomment-1092494664, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOT32KDNSYR733NRKZ7ID3TVD7ICRANCNFSM5LSLFSWA. You are receiving this because you were mentioned.

Apr 22 '22 11:04 ohsnyt