node icon indicating copy to clipboard operation
node copied to clipboard

Do not drain worker tasks on main thread

Open dinfuehr opened this issue 2 years ago • 11 comments

Hi!

This PR stops draining worker tasks on the main thread. I believe it's not necessary since V8 will join its own worker tasks as part of v8::Isolate::Dispose. It may also cause deadlocks now that those worker tasks may ask the main thread to perform a GC, see the discussion in this V8 issue here. This should help enable concurrent sparkplug again, which will get/is disabled with this PR.

Do you know why Node is draining worker tasks here? Am I seeing right that NodePlatform::DrainTasks is also invoked from the event loop here? If so, this might be a problem for performance as we block the main thread until those worker tasks are finished when the event loop is empty.

I have to admit that I didn't run tests yet for that PR and it may require V8 patches as well (see the V8 issue linked above) but I opened the PR already to discuss this.

dinfuehr avatar Apr 06 '23 13:04 dinfuehr

Are https://chromium-review.googlesource.com/c/v8/v8/+/4403395 and https://chromium-review.googlesource.com/c/v8/v8/+/4405689 necessary for this to work?

targos avatar Apr 07 '23 07:04 targos

Yes, they are necessary for this.

dinfuehr avatar Apr 08 '23 19:04 dinfuehr

I think @addaleax would be the best person to answer the questions since she wrote some of the initial patches of the migration to v8 platform, but my 2 cents:

Am I seeing right that NodePlatform::DrainTasks is also invoked from the event loop here?

Yes.

Do you know why Node is draining worker tasks here?

This seems to date back to the initial patches of V8 platform integration, but the same question was asked in https://github.com/v8/node/pull/69/files#r187129604 too. From some local testing it seems if we don't drain the worker tasks here (specifically, in the event loop), some foreground task (e.g. a wasm async compile task, which probably comes from the cjs-module-lexer) could get posted after the event loop already finishes, and the general processing order can be messed up - which is why there are a lot of exit code 13 failures in the test here, that means we have some leftover microtasks not yet cleared from the queue during shutdown. We run the microtasks after running foreground tasks here https://github.com/nodejs/node/blob/284e6ac20cc5afe39a0ffeea90b8684f6803d232/src/node_platform.cc#L423 (in the InternalCallbackScope destructor) when the Node.js environment is still around e.g. when we are still running the event loop. (Personally I am a bit puzzled why we are using InternalCallbackScope for this because its destructor does..a lot of things, probably way more than what PerIsolatePlatformData::RunForegroundTask needs).

joyeecheung avatar Apr 08 '23 20:04 joyeecheung

It looks like there are linter and test failures, could you please reabase and address those?

aduh95 avatar May 12 '24 19:05 aduh95

This issue/PR was marked as stalled, it will be automatically closed in 30 days. If it should remain open, please leave a comment explaining why it should remain open.

github-actions[bot] avatar May 12 '24 19:05 github-actions[bot]

Closing this because it has stalled. Feel free to reopen if this issue/PR is still relevant, or to ping the collaborator who labelled it stalled if you have any questions.

github-actions[bot] avatar Jun 12 '24 00:06 github-actions[bot]

I checked out the PR, rebased it and fixed the linter issues but for some reason I'm unable to push it back:

$ git push [email protected]:dinfuehr/node di/do-not-drain-worker-tasks -f
ERROR: Permission to dinfuehr/node.git denied to targos.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

In any case, here's a copy on my fork: https://github.com/targos/node/tree/do-not-drain-worker-tasks

targos avatar Sep 12 '24 14:09 targos

@aduh95 Thanks! Now I'm curious: how did you do it?

targos avatar Sep 13 '24 09:09 targos

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 88.06%. Comparing base (e21984b) to head (ab23463). Report is 44 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #47452      +/-   ##
==========================================
+ Coverage   88.04%   88.06%   +0.01%     
==========================================
  Files         651      652       +1     
  Lines      183409   183543     +134     
  Branches    35828    35864      +36     
==========================================
+ Hits       161487   161635     +148     
+ Misses      15174    15160      -14     
  Partials     6748     6748              
Files with missing lines Coverage Δ
src/node_platform.cc 87.80% <100.00%> (-0.03%) :arrow_down:

... and 55 files with indirect coverage changes

codecov[bot] avatar Sep 13 '24 11:09 codecov[bot]

Now I'm curious: how did you do it?

I rebased using the GH "Update branch" API – I initially tried pushing your branch using git, and saw the same error as you.

aduh95 avatar Sep 13 '24 13:09 aduh95

This might be already known, but from my testing, it's worker_thread_task_runner_->BlockingDrain(); that is hanging. Regardless of how it's executed (main thread or not), it shouldn't hang, no?

More specifically, it's the while loop at

https://github.com/nodejs/node/blob/2545b9e02cbdb208d531d1a04149f8b35a82ca5c/src/node_platform.cc#L639-L641

avivkeller avatar Oct 15 '24 20:10 avivkeller

I think the tests here are failing because the user blocking tasks still need to be blocked in the main thread (or otherwise e.g. wasm compilation tasks might get dropped even though the user script needs them to complete). Opened https://github.com/nodejs/node/pull/58047 to try that idea.

joyeecheung avatar Apr 27 '25 01:04 joyeecheung