node module: resolve format for all situations with auto module detection on

triggered by #53015 solves: #53016

this should be a consistent fix to always resolve the module format correctly. Enabling module detection by default made a few other tests need some adjustments because in this case they don't generate errors anymore. e.g. test-esm-cjs-exports.js instead of error becasue a .mjs imports a .js with ESM syntax it now successfully imports it and generates the warning that this should be fixed to avoid the performance penalty.

Kindly please review and let me know what you think (if changes are necessary).

make test && make lint => green

Co-authored-by: @GeoffreyBooth

@nodejs/loaders

May 18 '24 12:05 dygabo

Review requested:

[ ] @nodejs/loaders

May 18 '24 12:05 nodejs-github-bot

if the solution is feasible and finds approval, the --experimental-detect-module could already be removed in this PR. wdyt?

May 18 '24 12:05 dygabo

If you set detect_module to true, do all the tests (except the expected failures) pass?

330984554-a8fbc807-8c45-408c-a62e-3f87cb6ee3b8

May 18 '24 13:05 avivkeller

Also, 🎉 Thank you for tackling this :-)

May 18 '24 13:05 avivkeller

Great work! Thanks for getting to the bottom of this.

Will check your comments and split it into two PRs, will happen maybe beginning of next week.

May 18 '24 14:05 dygabo

Great work! Thanks for getting to the bottom of this.

Will check your comments and split it into two PRs, will happen maybe beginning of next week.

Lovely!

May 18 '24 14:05 avivkeller

split done, this fixes the --experimental-detect-module edge cases, no unflagging of the option. The unflagging is prepared and I can submit a PR after this one lands

May 20 '24 16:05 dygabo

RSLGTM

(I'm not a core collaborator, so this review doesn't have any power)

Why would you rubber-stamp review a PR if you're not a core collaborator? What does that achieve? Is it just to send a notification to folks subscribed to the PR? If so, simply add a comment saying you're doing that. Otherwise, please abstain, it's annoying (either review the PR, and in that case you're welcome to add an approval, either you don't, and in that case don't bother commenting).

May 27 '24 13:05 aduh95

The C++ linter is failing

May 27 '24 13:05 aduh95

RSLGTM

(I'm not a core collaborator, so this review doesn't have any power)

Why would you rubber-stamp review a PR if you're not a core collaborator? What does that achieve? Is it just to send a notification to folks subscribed to the PR? If so, simply add a comment saying you're doing that. Otherwise, please abstain, it's annoying (either review the PR, and in that case you're welcome to add an approval, either you don't, and in that case don't bother commenting).

I'm sorry, I reviewed the code and it looked good to me. I didn't do any testing, so I rubber stamp reviewed. I'm sorry if that's annoying, I'll refrain from reviewing in the future.

May 27 '24 13:05 avivkeller

I reviewed the code and it looked good to me. I didn't do any testing, so I rubber stamp reviewed

Rubber-stamp means the reviewer didn't (or barely) look at the code – at least that's how I've seen folks use it in this repo. If you reviewed the code, it's not rubber-stampy. Anyway, as someone who is not a collaborator, I would agree that your review is useful only if you're making a suggestion, otherwise it's only adding noise IMO.

May 27 '24 14:05 aduh95

So in 2667a81 (#53044) I added the test I suggested in https://github.com/nodejs/node/pull/53044#discussion_r1643821826. I did some experiments with adding some print lines within the load hook, to see what I was getting from await nextLoad(url, context) for an ambiguous file that contains ESM syntax. Without --experimental-detect-module, it’s this:

{
  "format": "commonjs",
  "responseURL": "file:///Users/geoffrey/Sites/node/test/fixtures/es-modules/package-without-type/esm-with-check.js"
}

With --experimental-detect-module, it’s this:

{
  "format": "module",
  "responseURL": "file:///Users/geoffrey/Sites/node/test/fixtures/es-modules/package-without-type/esm-with-check.js",
  "source": {
    "type": "Buffer",
    "data": [105,109,112,111,114,116,32,123,32,118,101,114,115,105,111,110,32,125,32,102,114,111,109,32,39,110,111,100,101,58,112,114,111,99,101,115,115,39,59,10,99,111,110,115,111,108,101,46,108,111,103,40,116,104,105,115,32,61,61,61,32,117,110,100,101,102,105,110,101,100,32,63,32,39,109,111,100,117,108,101,39,32,58,32,39,99,111,109,109,111,110,106,115,39,41,59,10]
  }
}

The responseURL property is not really documented; it’s mentioned in an example in https://nodejs.org/api/module.html#loadurl-context-nextload but not described in the hook signature, which is its own issue. (If we’re going to make significant changes to the hooks per https://github.com/nodejs/loaders/issues/201, though, this won’t matter.) Besides that, though, we can see here the effect of detect-module: in the first example, this ambiguous file has no source defined because of the quirks mentioned in the docs about how sources are undefined for files handled by the CommonJS loader, and this file is treated as CommonJS because it’s a .js file in a typeless scope. In the second example, the source has been read from disk because we read it in order to do detection, and we detected the ESM syntax, hence the "format": "module". This all looks good.

My concern above about detection not running on the result of a user-defined custom load hook was misplaced, I think, because per the docs the format property is required to be returned from a the load hook. Therefore there’s no reason to run detection again: we already know the format. The test I added is a faux ESM-to-CommonJS transpiler, that replaces some ESM syntax with its CommonJS equivalent, and returns type: 'commonjs'; and this gets run as CommonJS per the return value, overriding what had been set earlier by detection. I think this is all correct.

Therefore I think this PR should be ready to land, unless I’m missing something. I opened https://github.com/nodejs/node/pull/53558 to land the unrelated change first; @aduh95 and @nodejs/loaders, can you please review these two PRs? I think once these land, --experimental-detect-module should be (almost) ready to unflag.

Jun 24 '24 05:06 GeoffreyBooth

CI: https://ci.nodejs.org/job/node-test-pull-request/59945/

Jun 24 '24 05:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/59953/

Jun 24 '24 17:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/59958/

Jun 25 '24 05:06 nodejs-github-bot

CI: https://ci.nodejs.org/job/node-test-pull-request/59959/

Jun 25 '24 05:06 nodejs-github-bot

IMO it doesn’t make sense to eagerly detect the format at resolve, the format should only be confirmed during loading, when the source becomes available. Until then it should be allowed to be undefined. For one it’s not guaranteed that the filename is actually readable from the fs at resolve time. Also for libraries like tsx that uses the defaultResolve to probe possible default entry points, doing the detection at resolve incurs unnecessary overhead on formats that are not recognizable by Node.js. That should be left to default load when the final source code is available.

Jun 25 '24 06:06 joyeecheung

Until then it should be allowed to be undefined

With current version it still is allowed and it might happen that it is undefined (if one of the cases occur where it cannot be determined during resolve, like the typescript case, or file not accessible). The more confusing part is IMO where like with the last test being added, it can be determined but the load changes that in the scenario that transforms ESM to cjs during a load hook execution. But the current version on this branch covers correctly more cases than before (probably most of the cases except the esm/cjs conversion or the other way around).

Jun 25 '24 09:06 dygabo

@joyeecheung I tried today to modfiy this part according to your suggestion to have resolve in detect-module mode not setting anything in format if source is undefined (i.e. not try to read from file contained in url). Generally speaking, eliminating the resolve time "hint" detection everywhere would be a breaking change and it is IMO out of scope for this PR.

With current implementation the hint quality is better than it was before because of the detectModuleType call. There are still cases where it cannot be determined, like mentioned before, but no currently existing tests are affeced by those.

One example: tests like this one are not passing because they are based on loaders like this that expect the format to be determined by resolve. I did not find a quick solution for this. Do you or @GeoffreyBooth, @aduh95 have an idea on how to solve it?

Jun 26 '24 12:06 dygabo

Generally speaking, eliminating the resolve time "hint" detection everywhere would be a breaking change and it is IMO out of scope for this PR.

Doesn't this PR "improves" the format detection in resolve by trying harder (even though reading from the file system at resolve phase might be the wrong way to try it), and hence make the breaking change necessary for correcting this design flaw even more breaking than it already is? Format detection in resolve should be a best effort (tentative based on file extensions and package.json info) IMO, and we shouldn't go out of our ways to detect the format by reading the file and even parsing it at resolve phase. That leads to unnecessary overhead to the overall design especially if user hooks point to a non-existent URL or a file with unrecognizable format at resolve phase (for better error stack traces) and only override the format detection at load.

Jun 26 '24 14:06 joyeecheung

Format detection in resolve should be a best effort

I agree with this and also that the file read has an impact on the running time. But I am slightly out of ideas on how to solve the one failing test that I mentioned. I will give it one more try, see what solution I can find. Any suggestions on how to solve it are appreciated.

Jun 26 '24 14:06 dygabo

Is the failing test something like https://github.com/dygabo/node/blob/fix-for-unflagging-module-format-detection/test/fixtures/es-module-loaders/builtin-named-exports-loader.mjs#L31 ? In that case I think it's the test that's making wrong assumptions - the loader hook should not count on the context.format being ready - it could count on the result of nextLoad() containing a valid format, but not in the context that's provided before nextLoad is called. The doc also says that context.format can be undefined in load too.

Jun 26 '24 14:06 joyeecheung

In terms of format, I think there are actually two concepts:

Assumed format
Confirmed format

e.g. if it's a .js file not bound by a type field in package.json, it's assumed format is "commonjs" while confirmed format is "undefined" after resolve. Internally, the format that gets passed around is the confirmed format and is also what gets surfaced to user hooks. The problem is that this PR is going along with the direction that mixes the two into one in the load hook, then it's going to lead to detection with too many assumptions.

Jun 26 '24 14:06 joyeecheung

e.g. if it’s a .js file not bound by a type field in package.json, it’s assumed format is “commonjs” while confirmed format is “undefined” after resolve.

I’ve been referring to “a .js file not bound by a type field in package.json” as an “ambiguous” file. I think what we have now is that resolve sometimes returns a format hint of commonjs for ambiguous files, whereas with detection enabled it should probably be returning no hint at all; and then we determine the format during load. I don’t think certain cases changing from resolve returning format: 'commonjs' to format: undefined should be considered a breaking change, because per our docs format is optional and hooks shouldn’t rely on it being present or set to a particular value. Do others agree with that? (And the hooks are still experimental, after all.)

Because it does feel like the right approach here is to update the test so that it no longer expects format: 'commonjs' for ambiguous files, and we do the detection during load rather than potentially reading the source twice or reading it too early.

Jun 26 '24 22:06 GeoffreyBooth

Thinking about this a bit more over the past hours, I understand the concerns and am looking into alternative solution to propose.

"Module type detection" has somehow performance penalty in the title. Because it's an additional detection that needs to be done, that must cost something, otherwise I would not need it in the first place. We also have a warning in place that tells the user: ${url} parsed as an ES module because module syntax was detected; to avoid the performance penalty of syntax detection, add "type": "module" to ${pjsonPath}
additionally, I would expect projects that make such transformations during module loading that will change the module type detected during resolve tend to care less about the loading time, otherwise most of these could be done as a build/packaging step (but that is just an assumption, I'm sure that there are project that might choose that for convenience)
if we end up having less reliable "module type detection" during resolve with "module type detection" turned on than we had without "module type detection", seems to me as somehow missed goal of the feature.
not being allowed to look into the file even if we have the file URL reduces the real content of this PR to almost zero. The goal was to find a way to minimize the number of failed tests when detect_module is true. If we have to change the tests, it means the enablement of the flag is a breaking change. I am not in the position to implement breaking changes because I have no power of defending these decisions. I think this is breaking as soon as we have to modify the tests, independent of what is written in the documentation. Because probably examples from the node.js tests are copied just as much as official documentation.

I hope this makes sense. Having said that, I will continue to look for alternatives but this might take some time.

Jun 27 '24 04:06 dygabo

Talking to @dygabo and @joyeecheung, I think we need to explore a solution that doesn’t involve reading from disk during the resolve step. I’m going to mark this as draft for now so that it doesn’t get merged in the meantime.

It could be that we simply change the hinted format returned by resolve when detection is enabled: for ambiguous files, where detection will eventually determine the format, instead of resolve returning commonjs we return undefined, and this is just a change (possibly breaking, depending on your point of view). If enabling detection in general is considered semver-major, because things that used to error would no longer do so, then we can add this to the list of semver-major changes that happen as part of unflagging detection. I think in a detection-enabled world, we would want a hint of commonjs only for unambiguous CommonJS—.cjs extension or "type": "commonjs"—just as I think the module hint is currently only for unambiguous ESM. It would feel wrong otherwise.

I think the next step is to open a PR that unflags detection and updates all the tests accordingly, including making this format change. In that PR we can analyze the tests that needed changing and why, to gauge the scale of the breaking changes. We can mark it as don’t-backport so that it’s released in 23.0.0 only, and after we see the reaction to Node 23 we might consider backporting it if we ultimately decide that the changes aren’t truly semver-major and that the changes to experimental features aren’t too drastic.

Jun 28 '24 00:06 GeoffreyBooth

superseded and solved by https://github.com/nodejs/node/pull/53619.

Aug 01 '24 09:08 dygabo