Creating a transform stream to decompress Bzip2
From https://github.com/regular/unbzip2-stream/issues/10
I'm looking forward to use this module in https://etcher.io, however I'm having issues adapting the stream implementation that comes with this module as a typical NodeJS transform stream that I can plug in directly.
Is there an example somewhere I could take a look at?
Thanks a lot!
What could work really well for me is a function that decompresses a single chunk independently on the context (not sure if that can be done with Bzip2). I've tried passing a single chunk to decompressFile() but it complaints with "Data error", so I guess it needs to either read some sort of header containing information about the file, or it needs some other context.
Oh, I just noticed Bzip2.js exports an undocumented decompressBlock function.
It accepts a pos argument which seems to be used for seeking purposes (https://github.com/cscott/compressjs/blob/master/lib/Bzip2.js#L488), however what does it mean?
I found the following test case what uses decompressBlock:
it('should correctly decode our example file', function(){
var compressedData = fs.readFileSync('test/sample0.bz2');
var data = Bzip2.decompressBlock(compressedData, 32);
data = new Buffer(data).toString('utf8');
assert.equal(data, "This is a test\n");
});
Which sets pos to 32, however the test file used in this example is 55 bytes, so the position doesn't seem to be related to the file/block size.
To make things worse, such test case has the following test file -> blocks definitions:
[{file:'sample2',blocks:[544888]},
{file:'sample4',blocks:[32,1596228,2342106]}].forEach(function(o) {
sample2 is 73732 bytes and sample4 is 305260. What do each block mean?
OK, so as far as I understand, pos refers to the block inside a buffer that you want to decompress. 0 decompresses the first block, 1 decompresses the second block, etc. You're making use of this in the tests to just assert certain blocks instead of decompressing the whole thing, right?
In that case, I should be able to process a chunk from my transform stream and decompress each "block" from it. What is the block size though?
So it seems that I misunderstood what "blocks" mean in this context:
See http://www.bzip.org/1.0.3/html/memory-management.html
bzip2 compresses large files in blocks. The block size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) respectively. At decompression time, the block size used for compression is read from the header of the compressed file, and bunzip2 then allocates itself just enough memory to decompress the file. Since block sizes are stored in compressed files, it follows that the flags -1 to -9 are irrelevant to and so ignored during decompression.
So block size is 100000 * level (from 1 to 9). My "chunks" are not bzip2 blocks :/
I found a nice streaming example in your seek-bzip module (https://github.com/cscott/seek-bzip/blob/master/test/stream.js), which I adapted to the following:
var Bunzip = require('seek-bzip');
var stream = require('stream');
var Fiber = require('fibers');
var util = require('util');
var BunzipStream = function() {
var trans = this;
stream.Transform.call(trans); // initialize superclass.
this._fiber = new Fiber(function() {
var buffer = [], pos = 0;
var inputStream = new Bunzip.Stream();
inputStream.readByte = function() {
if (pos >= buffer.length) {
buffer = Fiber.yield(); pos = 0;
}
return buffer[pos++];
};
var outputStream = new Bunzip.Stream();
outputStream.writeByte = function(_byte) {
this.write(new Buffer([_byte]),0,1);
};
outputStream.write = function(buffer, bufOffset, length) {
if (bufOffset !== 0 || length !== buffer.length) {
buffer = buffer.slice(bufOffset, bufOffset + length);
}
Fiber.yield(buffer);
};
Bunzip.decode(inputStream, outputStream);
});
this._fiber.run();
};
util.inherits(BunzipStream, stream.Transform);
BunzipStream.prototype._transform = function(chunk, encoding, callback) {
const result = this._fiber.run(chunk);
return callback(null, result);
};
module.exports = BunzipStream;
I don't have much experience with Fibers, but as far as I understand, if I yield the buffer in outputStream.write with Fiber.yield(buffer), the this._fiber.run(chunk) will return the actual buffer, however its always a buffer containing a null value in my case. Am I missing something?
Closing stale issues