node-pty icon indicating copy to clipboard operation
node-pty copied to clipboard

ptys unexpectedly interrupted and exit early in GitHub Actions, Linux with io_uring enabled

Open lydell opened this issue 2 years ago • 8 comments

Environment details

  • OS: ubuntu-latest on GitHub Actions
  • node-pty version: 1.0.0, 1.1.0-beta4
  • Node.js version: 20

Summary of the below updates: If Node.js has io_uring enabled, the bugs occurs. Update: It works with Node.js 18! https://github.com/lydell/node-pty-bug/actions/runs/6283383630/job/17063792075 And 19: https://github.com/lydell/node-pty-bug/actions/runs/6283390976/job/17063807172 And 20.0.0: https://github.com/lydell/node-pty-bug/actions/runs/6283396940/job/17063819451 So it must have been introduced in some 20.x release … the search continues. Aha! It’s Node.js 20.3.0 that introduced it: https://github.com/lydell/node-pty-bug/actions/runs/6283409434 Switched the child process from node to bash and it got a bit flaky: https://github.com/lydell/node-pty-bug/actions/runs/6283443471/attempts/1 vs https://github.com/lydell/node-pty-bug/actions/runs/6283443471/attempts/2 Update: Can happen on Node.js 18.18.0 too: https://github.com/lydell/node-pty-bug/actions/runs/6415765977

Issue description

When running ptys in parallel, some of them unexpectedly receive “signal 1” (SIGHUP?) and exit early.

This happens in ubuntu-latest on GitHub Actions. But not in macOS-latest. And not my own Ubuntu machine (and not on my own macOS machine either).

I made a minimal repo to show it happening: https://github.com/lydell/node-pty-bug

Example failing output (ubuntu-latest): https://github.com/lydell/node-pty-bug/actions/runs/6283275300/job/17063579148

Example successful output (macOS-latest): https://github.com/lydell/node-pty-bug/actions/runs/6283275300/job/17063579184

This is so strange! I wonder what’s special about Linux on GitHub Actions that makes this happen. Any ideas?

lydell avatar Sep 23 '23 11:09 lydell

node-pty is only really tested currently with node up to v18 as it's what vscode ships as part of its Electron builds.

Tyriar avatar Sep 25 '23 16:09 Tyriar

This bug has also been significantly impacting us on cocalc.com, and I made an issue in our repo about it here with some more details, including some strace output: https://github.com/sagemathinc/cocalc/issues/6963

In particular, node-pty has this bug on Node.js 18.18.0 (released about 2 weeks ago), but not with Node 18.17.1 (the previous version). My guess is you'll hit this with vscode in December (?), when they might upgrade to Node v18.18. Right now electron is on v18.16, I think.

williamstein avatar Oct 05 '23 00:10 williamstein

I can confirm it happens on 18.18.0: https://github.com/lydell/node-pty-bug/actions/runs/6415765977

(As you can see, that was on attempt 2 of that run – first time it didn’t happen. Seems to not be 100 % reproducible. Also note how 20.8.0 succeeded there, but fails here: https://github.com/lydell/node-pty-bug/actions/runs/6415782042/job/17418282266.)

lydell avatar Oct 05 '23 07:10 lydell

Thanks for testing on 18.18.0.

Seems to not be 100 % reproducible.

For me at least this problem really is 100% reproducible via the exact same steps. What it takes to reproduce it is extremely weird, since it involves doing something once, then restarting a server, then doing the exact same thing again and then it fails 100% after that. It should be the other way around, where restarting the server should clear any state and make the problem go away, so I'm extremely puzzled by this bug.

I also have no idea if this is a bug in nodejs or in node-pty. It seems more likely to be a nodejs bug, but I didn't find anything clearly relevant when searching their issue tracker. I tried reverting to older versions of node-pty and they also exhibit this behavior. The changelog from 18.17.x --> 18.18.0 includes many updates of dependencies, and also three changes to child_process. This change looks potentially relevant:

https://github.com/nodejs/node/commit/b5df084e1e

They rewrote bits of code specifically related to how they handle subprocess termination in these changes, and when I reproduce this bug, I basically see:

  • terminate the subprocess by hitting ctrl+d to exit,
  • suddenly cpu spikes to 100% as io_uring_enter, etc., get called in an infinite loop (according to strace)

So I bet one of those child_process commits related to subrocesses exiting introduced a bug in node.js. ~~They are all Javascript code (not C code), so it might be easy to manually revert them and see if the bug goes away.~~

williamstein avatar Oct 05 '23 11:10 williamstein

Please set export UV_USE_IO_URING=0 as workaround and waiting for a fix (from upsteam maybe).

  • https://github.com/nodejs/node/issues/49911

I'm debugging for this, yeah it's 100% reproducible, the core path is: enable io_uring in libuv, create pty1, create pty2, destory pty1. The order is important, destory by stack order will not hit this bug.

kkocdko avatar Dec 07 '23 16:12 kkocdko

Thanks for finding that!

UV_USE_IO_URING=1 ➡️ 🚨 failure (https://github.com/lydell/node-pty-bug/actions/runs/7132159128/job/19422149709) (Node.js 20.10.0)

UV_USE_IO_URING=0 ➡️ ✅ expected output (https://github.com/lydell/node-pty-bug/actions/runs/7132168279/job/19422177649) (Node.js 20.10.0)

lydell avatar Dec 07 '23 18:12 lydell

Note for people trying to reproduce: Node.js 20.11.1 and 21.6.2 disabled io_uring by default due to a security issue. So right now, the latest Node.js versions don’t have the bug by default, but it can be enabled by setting UV_USE_IO_URING=1.

lydell avatar Mar 10 '24 12:03 lydell

Is there anyone who has managed to make a shorter reproduction than mine, that also reproduces outside of GitHub Actions? I think having such a reproduction here would increase the chances of fixing this.

lydell avatar Mar 10 '24 12:03 lydell