returning both string and byte[] or viceverca

Open winkmichael opened this issue 4 years ago • 1 comments

Hello,

I've been using your lib with success on a little project, work great. Thanks!

One question is there any particular reason you don't have Decompress and Compress to allow for both a byte[] input and output as well as string?

I realize I can easily convert it, and I do, as I then pass the compressed string -> convert to byte[] and then encrypt it.

Thanks, Mike

Mar 13 '21 01:03 winkmichael

Hi Mike,

Glad you're finding this project helpful.

This compression algorithm is for a very specific use-case - short ASCII encoded text. If you try to compress non-ascii characters you'll likely get a different decompression result. The algorithm uses a codebook trained specifically for this use-case. This is how we can achieve good compression on such small strings - the codebook is shipped with the library, rather than having to be embedded in the compressed stream. The downside is that the codebook is fixed. It works relatively well with lowercase English text, URLs and HTML snippets.

The compressed data (like almost all other compression algorithms) is represented as a byte array. To represent this as a string would require encoding (such as base64) which would increase its size and reduce the compression effectiveness. For this reason there is no string return type for the Compress method.

Because the compressed data is always bytes, there is no string argument for the Decompress method (only byte[] Input). Technically this method could return bytes (ASCII bytes). Having a different return type does not distinguish the method signature sufficiently, so likely a different method would be needed (eg. DecompressToBytes). If you feel strongly, I'm not averse to including such a method so long as backwards compatibility wasn't broken. It wouldn't be a terrible amount of work (likely just lifting out a bulk of the Decompress method into DecompressToBytes). I'll happily review a PR.

Ultimately this library is designed to operate on strings.

I'm not sure I follow your use-case. You should always compress the strings first, then perform encryption on the compressed data. Encryption libraries typically operate on byte arrays. Encryption should significantly increase the entropy of the data (a goal of good encryption). Increased entropy will significantly reduce compression effectiveness (as finding patterns is more difficult). Encoding: ASCII Data string > Compress byte[] > Encrypt byte[] > Store/Send byte[] Decoding: Load/Receive byte[] > Decrypt byte[] > Decompress string > ASCII Data string

You might also want to check out ShocoSharp which lets you train a codebook yourself. This codebook will hopefully be effective in compressing the type of data you need. Because you can train a model on any data, this library does let you compress/decompress byte arrays (or streams or array segments).

What are your thoughts?

Cheers, Gary

Jun 26 '21 03:06 garysharp