tpie icon indicating copy to clipboard operation
tpie copied to clipboard

Snappy compression for serialization_sorter.h ???

Open hendrikmuhs opened this issue 10 years ago • 3 comments

Hi,

I am using serialization_sorter.h to sort huge amounts of key-value data (strings, variable length).

Is it possible and do you think it makes sense to implement snappy compression for it? What would be the best place?

I would think here: https://github.com/thomasmoelhave/tpie/blob/master/tpie/serialization_stream.h

I also considered compressing at least the values myself in serialize and unserialize but as my values are something like 50-400 characters it will not be very effective to compress these short strings separately.

I think block-wise compression would make more sense.

(I would implement it myself and send you a PR)

hendrikmuhs avatar May 21 '15 14:05 hendrikmuhs

I would definitly make sence to compress the blocks, instead of compressing the individual text strings. If @mortal has time perhaps he can tell us what the best approach will be. If you want to implement this that is good, we can probably allocate some time for @svendcsvendsen to help you.

antialize avatar May 21 '15 15:05 antialize

Using Snappy for compression in the serialization_sorter definitely makes a lot of sense for situations like this. @mortal implemented the serialization code and knows most about it, however i'll definitely be available if you need some help in regards to the implementation.

svendcs avatar May 21 '15 15:05 svendcs

Actually, block-wise compression makes more sense for serialization streams than ordinary streams, since serialization streams do not support seek.

The four stream classes serialization{_reverse,}{_reader,_writer} are derivations of bits::serialization_{reader,writer}_base, and the two base classes implement read_block and write_block which the stream classes use more or less as a black box.

Compressed serialization streams should ideally be implemented to use the compressor thread, passing in read and write requests which support both forward and backward reading -- exactly what the serialization_reverse_reader needs.

Perhaps process_read_request and process_write_request are a good place to start learning how the compressed streams work.

Mortal avatar May 22 '15 08:05 Mortal