node icon indicating copy to clipboard operation
node copied to clipboard

Since v20.13.0 fetch() ing small files without reading their body delays process termination for 8 seconds

Open unilynx opened this issue 1 year ago • 7 comments

Version

v20.13.0

Platform

Linux 33f5daf61896 6.4.16-linuxkit #1 SMP PREEMPT Thu Nov 16 10:49:20 UTC 2023 aarch64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

docker run --rm -ti node:20.13.0 /bin/bash

 cat > test.js
async function main() {
  const x = await fetch ('http://cms6-sites.webhare.com/test-arnold/empty.docx');
  console.log(x);
}
main()

^D

node test.js

The node process will not terminate after downloading the document

How often does it reproduce? Is there a required condition?

100%

What is the expected behavior? Why is that the expected behavior?

Process should terminate

What do you see instead?

Process does not terminate

Additional information

  • downgrading to 20.12.2 fixes the issue
  • the reported uname above is on a Mac M3 with Linux inside docker, but I can reproduce it on native x86_64 linux hosts too
  • it does not fail when the document is served under a different mimetype (eg try with http://cms6-sites.webhare.com/test-arnold/empty.docx)
  • it also hangs when the URL is changed to https
  • I can also reproduce it with other docx files

Edit: fixed version number where it still works, there is no 20.12.3, oops.

unilynx avatar May 08 '24 21:05 unilynx

I see connection timeout instead of hanging with 20.13.0 and 22.1.0. The problem is not related to fetch but more fundamental network connection.

You can see the same happen on http.request.

import { request } from 'http'

request('http://cms6-sites.webhare.com/test-arnold/empty.docx', (response) => {
  console.log(response)
  response.resume()
}).end()

climba03003 avatar May 09 '24 03:05 climba03003

more fundamental network connection

Possibly. To exclude the issue with any servers I control (although I originally found the issue on localhost with a different server setup) I've also tested against external servers.

It seems file size related. now with this code

async function main() {
  const x = await fetch (process.argv[2]);
  console.log(x);
}
main()

These are the results I see now, where Fails means 'it downloads but the process never terminates'

To confirm it's not my server:

# FAILS: 20034 bytes, completely different server I know nothing about
node test.js https://www.mhlw.go.jp/content/11120000/000908344.docx
# FAILS: 20306 bytes
node test.js https://www.careers.govt.nz/assets/pages/cv-and-cover-letter-templates/Cover-letter-template.docx
# Works: 92682 bytes
node test.js https://www.euro-divers.com/wp-content/uploads/2020/05/scuba-diving-instructor-resume.docx

To demonstrate it appears to be size related (all these docs are random data)

# Works: 164864 bytes
node test.js http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx
# Fails: 81920 bytes
node test.js http://cms6-sites.webhare.com/test-arnold/empty-80kb.docx
# Fails: 93184 bytes
node test.js http://cms6-sites.webhare.com/test-arnold/empty-91kb.docx
# Works: 94208 bytes
node test.js http://cms6-sites.webhare.com/test-arnold/empty-92kb.docx

And it's not the file size per se but seems to have to do with the request decoding

# Works: 20608 but this file got gzipped by nginx as it didn't recgnoze the extension, and transferred as chunked
node test.js http://cms6-sites.webhare.com/test-arnold/empty.stuff

As far as I can tell, docs < 80KB fail, > 160 KB work, and I can confirm this with a few random servers found in google. For me the exact cutoff point is somewhere between 91KB and 92KB but if it's network related, the exact cutoff might differ for others

EDIT: now also tested from a Rocky 9 host at Linode, ie in a DC, which seems to cut out somewhere between 120KB and 160KB

unilynx avatar May 09 '24 08:05 unilynx

It is a bit different from my environment. All of the link from http://cms6-sites.webhare.com result in network timeout. But all the others works fine.

The address I connected to for cms6-sites.webhare.com are 104.248.92.95 and 2a03:b0c0:2:f0::a4:5001

import { request as http } from 'http';
import { request as https } from 'https';
import { fetch } from 'undici';

const sites = [
  'http://cms6-sites.webhare.com/test-arnold/empty.docx',
  'http://cms6-sites.webhare.com/test-arnold/empty-80kb.docx',
  'http://cms6-sites.webhare.com/test-arnold/empty-91kb.docx',
  'http://cms6-sites.webhare.com/test-arnold/empty-92kb.docx',
  'http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx',
  'https://www.mhlw.go.jp/content/11120000/000908344.docx',
  'https://www.careers.govt.nz/assets/pages/cv-and-cover-letter-templates/Cover-letter-template.docx',
  'https://www.euro-divers.com/wp-content/uploads/2020/05/scuba-diving-instructor-resume.docx'
]

for(const site of sites) {
  _fetch(site)
    .then(() => { console.log('fetch done: %s', site) })
    .catch((err) => { console.log('fetch fail: %s', site) })
  _request(site)
    .then(() => { console.log('request done: %s', site) })
    .catch((err) => { console.log('request fail: %s', site) })
}

function _fetch(href) {
  return fetch (href)
}

function _request(href) {
  return new Promise((resolve, reject) => {
    const request = href.startsWith('https') ? https : http
    request(href, (response) => {
      resolve(response)
      response.resume()
    }).on('error', (err) => {
      reject(err)
    }).end()
  })
}
request done: https://www.mhlw.go.jp/content/11120000/000908344.docx
fetch done: https://www.mhlw.go.jp/content/11120000/000908344.docx
request done: https://www.euro-divers.com/wp-content/uploads/2020/05/scuba-diving-instructor-resume.docx
fetch done: https://www.euro-divers.com/wp-content/uploads/2020/05/scuba-diving-instructor-resume.docx
request done: https://www.careers.govt.nz/assets/pages/cv-and-cover-letter-templates/Cover-letter-template.docx
fetch done: https://www.careers.govt.nz/assets/pages/cv-and-cover-letter-templates/Cover-letter-template.docx
fetch fail: http://cms6-sites.webhare.com/test-arnold/empty.docx
fetch fail: http://cms6-sites.webhare.com/test-arnold/empty-80kb.docx
fetch fail: http://cms6-sites.webhare.com/test-arnold/empty-91kb.docx
fetch fail: http://cms6-sites.webhare.com/test-arnold/empty-92kb.docx
fetch fail: http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx
request fail: http://cms6-sites.webhare.com/test-arnold/empty.docx
request fail: http://cms6-sites.webhare.com/test-arnold/empty-80kb.docx
request fail: http://cms6-sites.webhare.com/test-arnold/empty-92kb.docx
request fail: http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx
request fail: http://cms6-sites.webhare.com/test-arnold/empty-91kb.docx

climba03003 avatar May 09 '24 09:05 climba03003

The problem related to the auto select family attempt timeout. You can increase the timeout to a more reasonable value if your network is slow.

import { setDefaultAutoSelectFamilyAttemptTimeout } from 'net';
setDefaultAutoSelectFamilyAttemptTimeout(1000)

cc @ShogunPanda I see you added the command-line option --network-family-autoselection-attempt-timeout but it is actually not taking effect while I am testing. Only the setDefaultAutoSelectFamilyAttemptTimeout does affect the timeout.

climba03003 avatar May 09 '24 10:05 climba03003

The problem related to the auto select family attempt timeout. You can increase the timeout to a more reasonable value if your network is slow.

I'm not seeing network timeouts. I always see the requests complete immediately (well at least < 300ms, I haven't properly measured it).

My problem is the process is not terminating (apprently because of a lingering/leaking TCPSocketWrap object because that's what my CI reported, but I haven't yet built a testcase that would also pinpoint it)

If you're seeing network timeouts that's a pity, because then you're not reproducing the issue I'm seeing. I see you're in Hong Kong and all the testserves I have easy acces to are in Western Europe, so latency may be a factor.

If noone else chimes in to confirm or deny being able to reproduce the issue, I'll try to set up the files elsewhere

unilynx avatar May 09 '24 10:05 unilynx

My problem is the process is not terminating (apprently because of a lingering/leaking TCPSocketWrap object because that's what my CI reported, but I haven't yet built a testcase that would also pinpoint it)

~~Yes, it is not terminating with the built-in one. I am checking along with the timeout problem and using the undici one. If you are using the undici one, it is actually exit properly with keep-alive timeout is over. I assume the problem is fixed somehow in the upstream and will eventually be landed in v20 release line.~~

Said too early, it will close in both case but need to wait for a bit long if using the built-in. Built-in wait until the keep-alive timeout 60 seconds, undici exit early when no more connection after the keep-alive checking threshold. I assume the problem is fixed somehow in the upstream and will eventually be landed in v20 release line

Built-in ends in 65839.6259ms
Undici ends in 8334.4074ms

climba03003 avatar May 09 '24 10:05 climba03003

keep-alive might indeed be part of the issue, but that doesn't directly explain why the downloaded file size (or more likely: timing differences caused by file size) changes the behavior. The server doesn't change its keep-alive behavior on the size so I'd expect the process to either always terminate or always hang, but not depend on the downloaded size.

Thanks for the pointer. If I go straight for undici: { const fetch = require("undici") }

npm i [email protected]
node test.js http://cms6-sites.webhare.com/test-arnold/empty-40kb.docx  # works
node test.js http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx. # works

npm i [email protected]
node test.js http://cms6-sites.webhare.com/test-arnold/empty-40kb.docx  # FAILS: process hangs
node test.js http://cms6-sites.webhare.com/test-arnold/empty-160kb.docx. # works

the change: https://github.com/nodejs/undici/security/advisories/GHSA-9f24-jqhm-jfcw

aha. backpressure is probably the keyword here, as that would be related to file sizes. The workaround is to consume the incoming body, which I indeed don't, but that being the workaround would suggest that I don't need to do that in 6.6.1.

if I consume the body I don't see the timeout anymore, but I didn't care about the body. As undici 6.16.0 still show the same behavior, I'd probably better file a bug there first if this behavior is indeed intentional.

thanks for your help and pointing me in the right direction

PS: the process does terminate after 8 seconds, but for my use case that's an eternity :-)

unilynx avatar May 09 '24 10:05 unilynx

I posted the issue to undici here - https://github.com/nodejs/undici/issues/3230 - but without them changing anything, it's now fixed in node 22 (And still broken in 20 and 21)

I've run a bisect and as far as I can tell this commit to node 22 fixed the issue: https://github.com/nodejs/node/commit/1abff073921bcb0631602032aef0135bccfaee0d

unilynx avatar Aug 20 '24 13:08 unilynx