Crash in JSStreamSocket.finishShutdown
Version
v18.16.0
Platform
Both Windows + Linux, many different devices
Subsystem
net
What steps will reproduce the bug?
Unfortunately, I can't reproduce this myself. It's happening at frequent intervals to many of my users though, crashing Node on their machines when running httptoolkit-server (an HTTP debugging proxy) locally.
This issue has just appeared (10s of times a day now) after deploying an update from Node v16.20.1 to v18.16.0.
Internally, this code uses https://github.com/httptoolkit/mockttp/ which does plenty of interesting things with networking & streams that could be related, including running TLS, HTTP and WebSockets on top of other HTTP/1 connections & HTTP/2 streams for traffic proxying purposes. The relevant code for that is here: https://github.com/httptoolkit/mockttp/blob/main/src/server/http-combo-server.ts.
How often does it reproduce? Is there a required condition?
From the direct reports I've had, it seems this happens intermittently after handling a few minutes of HTTP traffic. The client is Chrome is every case I'm aware of, configured to proxy traffic via Node.
There's no obvious trigger on the server side - each user seems to be doing entirely different things there - but it's likely that a good amount of live web traffic is travelling through this stream so some kind of race condition seems very plausible.
What is the expected behavior? Why is that the expected behavior?
It shouldn't crash! :smile:
What do you see instead?
It crashes. The full stack trace I have is:
TypeError: Cannot read properties of null (reading 'finishShutdown')
File "node:internal/js_stream_socket", line 160, col 12, in JSStreamSocket.finishShutdown
File "node:internal/js_stream_socket", line 147, col 14, in null.<anonymous>
File "node:internal/process/task_queues", line 81, col 21, in process.processTicksAndRejections
Additional information
Obviously this isn't the most helpful bug report - there's no repro, and I have no easy way to quickly test possible solutions.
That said, I'm hoping somebody might be aware of what changes could be related and triggering this that I can investigate, or might know more about possible race conditions in the way that finishShutdown is used that can be tied up to avoid issues like this.
From a quick skim of the JSStreamSocket code, this looks suspicious to me:
- Error is here which means
finishShutdownis being called withnullas the first argument. - That was called in
doShutdownhere which meansthis._handlewas null whendoShutdownwas called. - We can't tell in this setup what triggered that, but
finishWritewould do this if a shutdown was deferred to wait for a write to complete (there's a TODO indoShutdownfrom @addaleax specifically pointing out that this is a bit messy). - In
doClose,this._handlewill be null here immediately before a call tofinishWrite.
So, AFAICT: if something calls doShutdown with a pending write (deferring the shutdown) and then calls doClose before the shutdown completes, doClose will clear _handle and then call finishWrite, and that will then call doShutdown to finish the pending shutdown, which will crash.
Does that make sense? Is there a reason that can't happen? There may be other cases but from the outside that seems like the clearest series that would lead to a crash here.
If that's the cause, some possible fixes:
-
It would seem easy to just pass
handleintodoShutdown(everywhere we call it here, we know we have the handle still in scope) but I don't know if that has larger implications - it looks likedoShutdownis part of the interface used elsewhere so we can't just change it. -
Maybe we can just return out of
doShutdownif thethis._handleis alreadynull- that means something else is already shutting everything down, so we shouldn't get involved (in this case,doClosecallsfinishShutdownfor us straight after callingfinishWriteso there is indeed no need for this). -
Maybe
doCloseshould check for this - if you callfinishWritewith a shutdown pending after destroying the stream, you're in trouble.doClosecallsfinishShutdownitself anyway, so it would seem reasonable for it to clear pending shutdowns because it's going to handle the shutdown itself.
The only change in this file between these versions is https://github.com/nodejs/node/commit/9ccf8b2ccc1b650ba0f9749e2ffe2780e51d1000 from @lpinca, which did change the logic directly around the failing code! Plausibly switching from setImmediate to process.nextTick meant that the IO events triggering this are now firing in a different order, causing the above race? Seems we should protect against the race regardless, but it's a plausible trigger imo.
downgrade for now
I've found a way to reproduce this and I'm going to investigate and open a PR for a fix myself.
Just some quick notes for reference, in case anybody else is interested:
- I'm seeing this happening specifically when proxying HTTPS over HTTP/2 - i.e. using an HTTP/2 stream as a base for a TLS connection.
- When proxying traffic from Chrome, within 30 seconds of mild browser usage, Node crashes with this error whilst closing connections, but there's some kind of race as it doesn't happen exactly consistently and most connections do still close down fine.
- This is indeed triggered by https://github.com/nodejs/node/commit/9ccf8b2ccc1b650ba0f9749e2ffe2780e51d1000 - with that change reverted this doesn't happen (but I think that change is still sensible, and it's just that this has exposed an existing bug).
I'll keep digging, my best guess is that there's some race between parallel TLS & HTTP/2 shutdown events that causes this, in which case I'd imagine it's a general issue for any JS stream sockets. Any clues from anybody who might know more would be very welcome :smile:.
I got back to this, and I can reproduce the issue reliably in a test, like so:
const common = require('../common');
const fixtures = require('../common/fixtures');
if (!common.hasCrypto)
common.skip('missing crypto');
const assert = require('assert');
const net = require('net');
const tls = require('tls');
const h2 = require('http2');
const tlsOptions = {
key: fixtures.readKey('agent1-key.pem'),
cert: fixtures.readKey('agent1-cert.pem'),
ALPNProtocols: ['h2']
};
const netServer = net.createServer((socket) => {
socket.allowHalfOpen = false;
// ^ This allows us to trigger this reliably, but it's not strictly required for the bug.
// In my production case, skipping this just fails elsewhere later.
h2Server.emit('connection', socket);
});
const h2Server = h2.createSecureServer(tlsOptions, (req, res) => {
res.writeHead(200);
res.end();
});
h2Server.on('connect', (req, res) => {
res.writeHead(200, {});
netServer.emit('connection', res.stream);
})
netServer.listen(0, common.mustCall(() => {
const proxyClient = h2.connect(`https://localhost:${netServer.address().port}`, {
rejectUnauthorized: false
});
const proxyReq = proxyClient.request({
':method': 'CONNECT',
':authority': 'example.com:443'
});
proxyReq.on('response', common.mustCall(((response) => {
assert.equal(response[':status'], 200);
console.log('Tunnel created');
// Create a TLS socket within the tunnel, and start sending a request:
const tlsSocket = tls.connect({
socket: proxyReq,
ALPNProtocols: ['h2'],
rejectUnauthorized: false
});
// Immediately kill the TLS socket:
tlsSocket.destroy();
// => async CRASH
})));
}));
This crashes uncatchably with Node v18+ but runs without errors in Node v16.
I'll open a PR with that soon, I need to think a little more about the fixes.
The error here is on the server side, when proxying a stream over HTTP/2, but I think this is a general race condition that could affect any use of JS Stream Socket. The stack of streams in this case is:
- net.Socket
- TLS socket
- H2 session
- H2 stream <-- This is an HTTP CONNECT stream, proxying traffic on top
- JS Socket Stream <-- This appears because TLS expects to run over a socket, not a stream
- TLS stream
A requirement to easily reproduce this is that the TLS stream has allowHalfOpen = false set (but that appears to be standard for TLS in many cases). If that's omitted, it still breaks, but by erroring with https://github.com/nodejs/node/issues/46094 later on instead (presumably when the TLS stream ends the other half for other reasons). Hopefully fixing this might also fix that issue.
While the example ends up complicated, that's just because this setup ensures it fails 100% of the time - I think there's a fundamental race here, and most other uses of JS Socket Stream (JSS) can fail due to this with the right timing.
What's going wrong here is:
- A server is handling a stream (here TLS) on top of a JSS on top of another stream (here HTTP/2).
- The remote client side closes their streams.
- The first event on the server side is the H2 stream ending. This calls destroy on the H2 stream, and schedules a next-tick 'close'.
- The TLS stream readable also ends, and due to
allowHalfOpen = falseit schedules a next-tickstream.end()to close the writable side - The JSS then receives that 'close' event from the H2 stream it's wrapping, and calls
this.destroy()- This runs
doCloseon the JSS itself, which schedules afinishWrite&finishShutdown, and then (synchronously with the 'close' event) callsthis._handle = nullto clear the JSS's handle.
- This runs
- Then the TLS stream writable 'end' fires
- This calls
Socket._finalon the TLS stream, which callsthis._handle.shutdown, which callsdoShutdownon the JSS - The JSS has already had
doClosecalled, been destroyed (sothis._handleis null), and has afinishShutdownscheduled viasetImmediate.doShutdowngrabsthis._handle(uh oh) and:- Waits for
process.nextTick - Calls
.end(cb)on the H2 stream it's wrapping (which has already closed) - That fires immediately, because it's already closed
- Then it runs
finishShutdown, using the handle, which is null :boom:
- Waits for
This didn't happen before https://github.com/nodejs/node/commit/9ccf8b2ccc1b650ba0f9749e2ffe2780e51d1000, because with setImmediate instead of process.nextTick in that last step, the pending doClose steps (finishWrite & finishShutdown) ran first and set this[kCurrentShutdownRequest] to null, which tells the 2nd finishShutdown with the null handle to do nothing (so it never actually uses the null handle).
This is easiest to reproduce with allowHalfOpen but that's not strictly required - I think this could affect any cases where the stream using and the stream within a JS Stream Socket both end in the same tick (i.e. any time you run doClose and doShutdown in the same tick, you're going to crash next tick).
I've got quite a few working solutions to this - the challenging is finding the 'right' one. I'll open a PR for that tomorrow.
Fixed by #49400