Avoid 0xFF00 markers in Huffman data

Open kornelski opened this issue 11 years ago • 1 comments

0xFF bytes in Huffman data need special coding that adds another byte of overhead. If this overhead could be reliably minimized, then approximation of size if scan data described #15 would be more effective.

Here's an algorithm off the top of my head:

Find which code happens to overlap the 0xFF bytes most frequently
Swap the code with another code of the same length (if there isn't a code to swap with — maybe modify/regenerate the Huffman tree (e.g. flip all bits in all codes?))
goto 1 as long as result keeps getting smaller

And/Or:

For all symbols in the stream make histogram of pairs (code[symbol[n-1]], code[symbol[n]])
Find which pairs have 8+ consecutive 1s.
Starting from most frequent pair swap codes with other codes of the same length to break the streaks.

Likely it can be improved.

Mar 19 '14 01:03 kornelski

I have to check how codes having the same Huffman code length are stored in the file, I seem to recall that it's by increasing value and nothing related to there actual frequency (a part from having the same Huffman code length which comes from the fact that they have similar frequencies). i.e codes of length 5, 4 entries : 7 9 10 12 But if 12 appears a bit more frequently than 10 and unfortunately has a code ending with a lot of ones like 00111 it could be smart move to change the recording order to 7 9 12 10 and this time 12 would get a less "dangerous" 00110 code and 10 would get 00111.

I'm pretty sure the histogram of pairs will not work since there is often raw binary data in-between two Huffman codes.

Mar 19 '14 22:03 frkay