Destroy sessions when an error occurs
http2-wrapperversion I used was2.2.0.
I created a BFF with got which sends requests to a backend server on GCP Cloud Run, and encountered unrecovered repeated ETIMEOUT errors. (This ETIMEOUT error was caused by got and was passed to http2-wrapper's clientRequest._destroy.)
While this error occurred, all request got sent did not reach to a backend.
According to the logs I embedded in a forked http2-wrapper, while the error occurred, sessions were always gotten from agent.sessions.
On the other hand, when an error occurred but a connection recovered, the sessions was newly created.
I could achieve this recovering behavior forcedly by adding the codes in clientRequest._destroy, which destroys sessions in clientRequest.agent.sessions, delete entries in clientRequest.agent.queue, and delete a session in clientRequest.agent.tlsSessionCache by an origin (clientRequest[kOrigin]).
However this modification is not perfect because there was a time when unrecovered errors occurred. (at that time, sessions were gotten from agent.sessions.)
Is it not necessary to add such process to destroy sessions and remove them from queue, etc. when an error occurs?
P.S. This unrecovered errors never occurred when setting .agent = false of http2.auto's options.
The code at using got
const gotOpts = {
method: 'post',
headers: headers,
body: body,
https: { rejectUnauthorized: false },
https: true,
request: http2wrapper.auto,
timeout: { lookup: 100, connect: 60000, secureConnect: 1000, send: 10000, response: 10000 },
};
const response = await got(url, gotOpts);
...
And the codes I added to http2-wrapper to force to recover a connection when an error occurred.
(client-request.js)
...
class ClientRequest extends Writable {
...
async _destroy(error, callback) {
// --- start of added code ---
if (this.agent) {
if (error) {
const origin = this[kOrigin];
this.agent.destroySessionsByOrigin(origin);
}
}
// --- end of added code ---
....
}
(agent.js)
...
class Agent extends EventEmitter {
...
destroySessionsByOrigin(origin) {
const {sessions, queue) = this;
for (const normalizedOptions in sessions) {
const thisSession = sessions[normalizedOptions];
for (let index = 0; index < thisSessions.length; index++) {
const session = thisSesions[index];
if (session[kOriginSet].includes(origin)) {
session.destroy();
}
}
const name = `${normalizedOptions}:${origin}`;
this.tlsSessionCache.delete(name);
}
for (const normalizedOptions in queue) {
const entries = queue[normalizedOptions];
for (const normalizedOrigin in entries) {
if (normalizedOrigin === origin) {
entries[normalizedOrigin].destroyed = true;
}
}
const name = `${normalizedOptions}:${origin}`;
this.tlsSessionCache.delete(name);
}
}
destroy() {
...
Hi @sttk ! We are experiencing an issue where got sometimes ends up in a state where none of the requests reach the backend. Instead, requests timeout with timings that look like this:
timings: {
"start": 1698668866493,
"socket": 1698668866494,
"upload": 1698668866494,
"error": 1698668896495,
"phases": {
"wait": 1,
"request": null,
"total": 30002
}
}
This looks fishy to us, as "request" might be undefined, but should never be null to our understanding.
We suspect that this has something to due with http2 and it sounds similar to what you're describing here:
While this error occurred, all request got sent did not reach to a backend.
However, we haven't been able to reproduce the issue locally yet. Did you find a way to reliably reproduce this issue? If so, how did you do that?
Hi @sttk and @szmarczak !
I was able to reproduce the issue now, please find an example in this repository: https://github.com/thesauri/http2-wrapper-timeout-bug
In the example, the got client ends up in a state where requests stop working completely:
- First, the example code fires a large number of requests to an endpoint that doesn't respond
- Then, the example code tries requesting an endpoint that should respond normally
- These requests now fail in timeout errors too
To verify that issue is in http2-wrapper, and not got, I tried removing the http2: true flag and request: http2wrapper.auto to ensure that HTTP 1.1 is used. With HTTP 1.1, the second set of requests work just fine.
Also, as a further observation, reducing the number of non-responding requests to a smaller amount (say n=100) does resolve the issue in the example.
Unfortunately, we do run into this issue every once in a while in our production environment running with real-world flakiness. The only solutions we've found so far is either disabling HTTP2 completely or restarting the server when this happens.
Interesting... Commenting out limit: 0 in @thesauri example triggers a Node.js assertion:
$ node app.js
Waiting for timeout requests to finish...
(node:24727) [DEP0111] DeprecationWarning: Access to process.binding('http_parser') is deprecated.
(Use `node --trace-deprecation ...` to show where the warning was created)
Server listening on port 9999
node:internal/assert:14
throw new ERR_INTERNAL_ASSERTION(message);
^
Error [ERR_INTERNAL_ASSERTION]: This is caused by either a bug in Node.js or incorrect usage of Node.js internals.
Please open an issue with this stack trace at https://github.com/nodejs/node/issues
at assert (node:internal/assert:14:11)
at internalConnectMultiple (node:net:1118:3)
at Timeout.internalConnectMultipleTimeout (node:net:1687:3)
at listOnTimeout (node:internal/timers:575:11)
at process.processTimers (node:internal/timers:514:7) {
code: 'ERR_INTERNAL_ASSERTION'
}
Node.js v21.1.0
Hi @szmarczak,
I have created a reproduction of a similar issue here:
https://github.com/mantysalo/http2-issue
This only seems to manifest itself when using Got with retry enabled so I opened an issue in the Got repository as well.
I managed to reproduce this by using just the http2-wrapper library so I closed https://github.com/sindresorhus/got/issues/2310
I have changed the reproduction repository to only use http2-wrapper and node:http2 module.
There are some interesting things I've noticed while debugging this:
Issue only happens if the Send-Q of the socket is full
What I mean by full is that the value under Send-Q in netstat -anpo does not increase any further, even though the application is making requests.
See netstat logs below 👇
Netstat logs
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 0 94506 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (76.46/0/14)
tcp 0 94650 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (75.45/0/14)
tcp 0 94772 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (74.45/0/14)
tcp 0 94916 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (73.45/0/14)
tcp 0 95060 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (72.45/0/14)
tcp 0 95167 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (71.44/0/14)
tcp 0 95239 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (70.44/0/14)
tcp 0 95348 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (69.43/0/14)
tcp 0 95492 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (68.43/0/14)
tcp 0 95636 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (67.42/0/14)
tcp 0 95780 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (66.42/0/14)
tcp 0 95924 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (65.41/0/14)
tcp 0 96068 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (64.41/0/14)
tcp 0 96212 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (63.41/0/14)
tcp 0 96356 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (62.40/0/14)
tcp 0 96500 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (61.40/0/14)
tcp 0 96607 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (60.39/0/14)
tcp 0 96716 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (59.39/0/14)
tcp 0 96823 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (58.39/0/14)
tcp 0 96932 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (57.38/0/14)
tcp 0 97076 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (56.38/0/14)
tcp 0 97220 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (55.38/0/14)
tcp 0 97364 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (54.37/0/14)
tcp 0 97508 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (53.37/0/14)
tcp 0 97652 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (52.36/0/14)
## Send-Q is now full and does not increase even though requests firing
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (51.36/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (50.35/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (49.35/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (48.34/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (47.33/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (46.33/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (45.32/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (44.32/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (43.32/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (42.31/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (41.31/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (40.31/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (39.30/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (38.30/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (37.29/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (36.29/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (35.28/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (34.27/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (33.27/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (32.26/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (31.25/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (30.24/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (29.24/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (28.24/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (27.23/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (26.23/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (25.23/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (24.22/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (23.21/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (22.21/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (21.20/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (20.20/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (19.19/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (18.18/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (17.18/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (16.18/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (15.17/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (14.17/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (13.16/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (12.15/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (11.15/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (10.14/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (9.14/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (8.13/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (7.13/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (6.12/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (5.11/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (4.10/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (3.10/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (2.09/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (1.09/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (0.08/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (0.00/0/14)
tcp 0 97776 172.21.0.4:59642 172.21.0.2:3001 ESTABLISHED 1/node probe (0.00/0/14)
This explains why I saw the issue with got, as got adds extra headers to the request, thus making the Send-Q fill up quicker.
If the Send-Q is not full and the socket is closed by the OS, following debug logs are printed:
http2-issue-client-1 | NET 1: destroy
http2-issue-client-1 | NET 1: close
http2-issue-client-1 | NET 1: close handle
http2-issue-client-1 | HTTP2 1: Http2Session <invalid>: socket error [read ETIMEDOUT]
http2-issue-client-1 | HTTP2 1: Http2Session client: destroying
http2-issue-client-1 | HTTP2 1: Http2Session client: start closing/destroying Error: read ETIMEDOUT
http2-issue-client-1 | at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
http2-issue-client-1 | errno: -110,
http2-issue-client-1 | code: 'ETIMEDOUT',
http2-issue-client-1 | syscall: 'read'
http2-issue-client-1 | }
However, when the Send-Q is full, no such logs are printed and the Http2Session keeps using the socket even though it has been closed by the OS (Weird that this does not produce any errors?)
Issue happens with node:http2 module as well
I've included the code in the reproduction repository. There might very well be an error in my code as I'm not very familiar with the http2 module, but I find it weird that it happens only when the Send-Q is full, and when it's not full there are clear errors printed out in the debug logs.
EDIT:
Looks like it may be a very old bug in Node https://github.com/nodejs/node/issues/4417 this issue has basically all the same symptoms.