node icon indicating copy to clipboard operation
node copied to clipboard

Gzip with a large buffer source doesn't raise an error

Open junbao00 opened this issue 2 months ago • 2 comments

Version

v24.11.1

Platform

Linux cc96c567cc46 6.12.53 #1 SMP PREEMPT_DYNAMIC Tue Oct 28 18:05:29 UTC 2025 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

import { createReadStream, createWriteStream } from 'node:fs';
import { Readable } from 'node:stream';
import { pipeline } from 'node:stream/promises';
import { createGunzip, createGzip } from 'node:zlib';
import { buffer } from 'node:stream/consumers';

const buf = Buffer.allocUnsafe(4646837723);
await pipeline(Readable.from([buf]), createGzip(), createWriteStream('test.gz', { flags: 'w' }));
const out = await buffer(createReadStream('test.gz').pipe(createGunzip()));
console.log(out.length);

How often does it reproduce? Is there a required condition?

Always

What is the expected behavior? Why is that the expected behavior?

throw an error or output 4646837723

What do you see instead?

output 351870427

Additional information

No response

junbao00 avatar Nov 22 '25 06:11 junbao00

From my research regarding this issue this is what I found.


Gzip uses a 32-bit field (ISIZE) to store the original uncompressed size. This field can only represent values up to:

2^32 = 4,294,967,296 bytes

Your input buffer is:

4,646,837,723 bytes

Since this exceeds the 4 GB limit, gzip wraps the number modulo 2^32, exactly as defined in RFC 1952 (“ISIZE contains the size of the original input modulo 2^32”).

The modulo calculation is:

4646837723 - 4294967296 = 351870427

So the value stored in the gzip trailer becomes:

351,870,427 bytes

This is why gunzip() returns an output of length 351,870,427, not the real 4.6 GB size. This is expected behavior per the gzip spec — gzip cannot represent uncompressed sizes ≥ 4 GB.

If the original size must be preserved, gzip cannot be used; ZIP64 or Zstandard are required.


Related links: https://unix.stackexchange.com/questions/612905/how-portable-is-a-gzip-file-over-4-gb-in-size

So from my understanding,

This is not a Node.js bug. The behavior is due to a limitation in the gzip file format specification (RFC 1952). Gzip stores the uncompressed size in a 32-bit field (ISIZE), which can only represent values up to 4 GB. Any file larger than that will wrap around modulo 2³², which is why the output is smaller than the original size.

You can throw error manually without changing internals something like this

import { createReadStream, createWriteStream } from 'node:fs';
import { Readable } from 'node:stream';
import { pipeline } from 'node:stream/promises';
import { createGunzip, createGzip } from 'node:zlib';
import { buffer } from 'node:stream/consumers';
const MAX_GZIP_SIZE = 2 ** 32;
const buf = Buffer.allocUnsafe(4646837723);

if (buf.length > MAX_GZIP_SIZE) {
  throw new Error(`Input too large for gzip: ${buf.length} bytes`);
}
await pipeline(Readable.from([buf]), createGzip(), createWriteStream('test.gz', { flags: 'w' }));
const out = await buffer(createReadStream('test.gz').pipe(createGunzip()));
console.log(out.length);

pckrishnadas88 avatar Dec 03 '25 18:12 pckrishnadas88

Not true. If I split the buffer manually and write all of them to the stream. Everything will be fine. Like this:

// entire test code
import { createReadStream, createWriteStream } from 'node:fs';
import { unlink } from 'node:fs/promises';
import { Readable } from 'node:stream';
import { buffer } from 'node:stream/consumers';
import { pipeline } from 'node:stream/promises';
import { createGunzip, createGzip } from 'node:zlib';

const buf = Buffer.allocUnsafe(4646837723);

console.time('test');
await pipeline(Readable.from([buf]), createGzip(), createWriteStream('test.gz', { flags: 'w' }));
const out = await buffer(createReadStream('test.gz').pipe(createGunzip()));
await unlink('test.gz');
console.timeLog('test', 'Write with single buffer: ' + out.length);

const kChunkSize = 2 ** 31 - 1;
const parts = [];
for (let i = 0; i < buf.length; i += kChunkSize) {
  parts.push(buf.subarray(i, i + kChunkSize));
}

await pipeline(Readable.from(parts), createGzip(), createWriteStream('test-2.gz', { flags: 'w' }));
const out2 = await buffer(createReadStream('test-2.gz').pipe(createGunzip()));
await unlink('test-2.gz');
console.timeLog('test', 'Write with multiple buffer chunks: ' + out2.length + ', is same: ' + buf.equals(out2));

The execute output is:

test: 831.56ms Write with single buffer: 351870427
test: 19.653s Write with multiple buffer chunks: 4646837723, is same: true

junbao00 avatar Dec 04 '25 15:12 junbao00