Issue with simple decompression of a stream compressed string
I have the following code, where I try to 'simple' decompress json lines compressed with streaming compression. When comparing to decompressions from a 'simple' compressed string I mostly get the same result, however sometimes the decompressed result doesn't match. You can replicate the issue with the following code:
def decompress_compressed_lines(line_length)
lines = (0...10_000).map{"a" * line_length}
stream = Zstd::StreamingCompress.new
compressed_with_stream = lines.map do |line|
stream.compress(line)
stream.finish
end
compressed_with_simple = lines.map do |line|
Zstd.compress(line)
end
aaa = compressed_with_stream.map{|a| Zstd.decompress(a)}
bbb = compressed_with_simple.map{|b| Zstd.decompress(b)}
aaa.zip(bbb).count{|a, b| a != b}
end
>> 20.times.map{PlExportCache.decompress_compressed_lines(1023)}
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>> 20.times.map{PlExportCache.decompress_compressed_lines(1024)}
=> [0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0]
>> 20.times.map{PlExportCache.decompress_compressed_lines(2047)}
=> [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0]
>> 20.times.map{PlExportCache.decompress_compressed_lines(2048)}
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
If I replace 'simple' decompression with stream decompression I consistently get the correct results. Is there any obvious reason why this would happen? Using zstd-ruby (1.5.6.6).
Edit: It seems the issue occurs only when lines are between 1024 and 2048 of length.
It appears to happen more with longer strings and it looks like it's always the first 32 bytes od the output that get corrupted
It could be caused by race condition occurring when multiple threads call ZSTD_decompressStream which modifies both input and output buffer. Removing threading in zstd_stream_decompress seems to fix the problem.
It is curious though that it doesn't seem to happen (as often?) with Zstd::StreamingDecompress. Maybe something about how decompress_buffered creates output vs. how rb_streaming_decompress_decompress does it?
The weird randomness probably stems from garbage collector rather than threading. Adding RB_GC_GUARD to loop in decompress_buffered as per https://docs.ruby-lang.org/en/master/extension_rdoc.html#label-Appendix+E.+RB_GC_GUARD+to+protect+from+premature+GC seems to also fix the problem
diff --git a/ext/zstdruby/zstdruby.c b/ext/zstdruby/zstdruby.c
index 512aa7b..bdab9c6 100644
--- a/ext/zstdruby/zstdruby.c
+++ b/ext/zstdruby/zstdruby.c
@@ -102,6 +102,7 @@ static VALUE decompress_buffered(ZSTD_DCtx* dctx, const char* input_data, size_t
rb_raise(rb_eRuntimeError, "%s: %s", "ZSTD_decompressStream failed", ZSTD_getErrorName(ret));
}
rb_str_cat(result, output.dst, output.pos);
+ RB_GC_GUARD(output_string);
}
ZSTD_freeDCtx(dctx);
return result;
That sounds quite terrifying. Any chance for you to take a look, @SpringMT?
Sorry, I'll check this in this week 🙇
@mperice Sorry for the late reply. I tried to reproduce the issue, but I couldn't in my environment with ruby 3.3.0.
Could you tell me which version of ruby you are using?
We are using 3.2.4.
On Fri, Oct 25, 2024, 15:45 Spring_MT @.***> wrote:
@mperice https://github.com/mperice Sorry for the late reply. I tried to reproduce the issue, but I couldn't in my environment with ruby 3.3.0.
Could you tell me which version of ruby you are using?
— Reply to this email directly, view it on GitHub https://github.com/SpringMT/zstd-ruby/issues/93#issuecomment-2437824389, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6HAZFLVYCT5KUAAREVC3Z5JDP5AVCNFSM6AAAAABNVLZQLKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZXHAZDIMZYHE . You are receiving this because you were mentioned.Message ID: @.***>
I couldn't reproduce in my environment with ruby 3.2.4.
require 'zstd-ruby'
def decompress_compressed_lines(line_length)
lines = (0...10_000).map{"a" * line_length}
stream = Zstd::StreamingCompress.new
compressed_with_stream = lines.map do |line|
stream.compress(line)
stream.finish
end
compressed_with_simple = lines.map do |line|
Zstd.compress(line)
end
aaa = compressed_with_stream.map{|a| Zstd.decompress(a)}
bbb = compressed_with_simple.map{|b| Zstd.decompress(b)}
aaa.zip(bbb).count{|a, b| a != b}
end
p 20.times.map{ decompress_compressed_lines(2047) }
% bundle exec ruby test.rb
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
🤔
The above experiments were done in a rails environment. Have installed a fresh installation of ruby 3.3.5 and am unable to produce ti as well.
Edit: Did a fresh install of Ruby 3.1.4, 3.2.4 I was able to reproduce it with a bit more iterations:
irb(main):023> p 20.times.map{ decompress_compressed_lines(2047) }
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
=> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
irb(main):024> p 200.times.map{ decompress_compressed_lines(2047) }.sum
4
=> 4
irb(main):025> p 200.times.map{ decompress_compressed_lines(2047) }.sum
0
=> 0
irb(main):026> p 200.times.map{ decompress_compressed_lines(2047) }.sum
8
=> 8
Running Ubuntu 22.04.