node icon indicating copy to clipboard operation
node copied to clipboard

fs: improve `cpSync` performance

Open anonrig opened this issue 1 year ago • 16 comments

By moving the function to C++, we should get a significant improvement in performance.

I'm extremely open to suggestions on improving the benchmarks, but with the current state, here are the results:

local benchmarks

macOS M1 Max

                           confidence improvement accuracy (*)    (**)   (***)
fs/bench-cpSync.js n=1            ***     63.75 %      ±10.38% ±13.83% ±18.03%
fs/bench-cpSync.js n=100          ***    607.18 %      ±46.07% ±62.03% ±82.22%
fs/bench-cpSync.js n=10000        ***    690.84 %      ±30.74% ±41.37% ±54.83%

Be aware that when doing many comparisons the risk of a false-positive result increases.
In this case, there are 3 comparisons, you can thus expect the following amount of false-positive results:
  0.15 false positives, when considering a   5% risk acceptance (*, **, ***),
  0.03 false positives, when considering a   1% risk acceptance (**, ***),
  0.00 false positives, when considering a 0.1% risk acceptance (***)

benchmark ci

19:04:42                            confidence improvement accuracy (*)    (**)   (***)
19:04:42 fs/bench-cpSync.js n=1            ***    154.37 %       ±6.67%  ±8.88% ±11.57%
19:04:42 fs/bench-cpSync.js n=100          ***    163.66 %       ±9.09% ±12.22% ±16.16%
19:04:42 fs/bench-cpSync.js n=10000        ***    103.21 %       ±0.96%  ±1.28%  ±1.67%

benchmark ci: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1572/

anonrig avatar Jun 22 '24 00:06 anonrig

cc @nodejs/performance @nodejs/fs @nodejs/cpp-reviewers

anonrig avatar Jun 25 '24 18:06 anonrig

cc @nodejs/tsc

this pull-request is now a semver-major pull-request due to the following changes, and will not be ported to previous versions if we don't make an exception:

  • For the test case of It throws error if symlink in src points to location in dest., we were throwing error code ERR_FS_CP_EINVAL. With std::filesystem it's not possible to get this edge case without a significant performance penalty. We are now throwing ERR_FS_CP_EEXIST which is also correct. (File exists)
  • For the test case of It throws error if symlink in dest points to location in src., we were throwing error code ERR_FS_CP_SYMLINK_TO_SUBDIRECTORY, which checks for similar behavior in previous bullet point but unfortunately it was returning a different error code. Now, we are returning ERR_FS_CP_EEXIST.
  • For the test case of It throws EEXIST error if attempt is made to copy symlink over file. we were throwing error EEXIST and right now we are returning ERR_FS_CP_EEXIST.
  • For the test case of It copies link if it does not point to folder in src., previously it succeeded. Right now, it throws an error. I think the new behavior is more correct. If there is agreement, we have to ditch this pull-request, or go with copying files one by one.

I appreciate if you could share your comments and thoughts on this.

anonrig avatar Jun 25 '24 20:06 anonrig

I'm okay with those changes as semver-major. I would be -1 on any sort of exception.

richardlau avatar Jun 25 '24 20:06 richardlau

I'm not sure, can't we re-map the thrown errors to the existing codes (e.g. in a catch, even if it's hard/degrates performance in C++ land (why?))?

Wouldn't that only hurt the performance in the error case which isn't the interesting one?

I'm wondering if our users for APIs like cpSync that are less performance sensitive (since they're blocking) don't care more breaking changes than a perf gain.

(Just to be extra explicit - I'm not blocking this PR)

benjamingr avatar Jun 25 '24 20:06 benjamingr

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

benjamingr avatar Jun 25 '24 20:06 benjamingr

I'm not sure, can't we re-map the thrown errors to the existing codes (e.g. in a catch, even if it's hard/degrates performance in C++ land (why?))?

These specific errors can only be caught if we traverse the filesystem one by one, and make the necessary copy operations. Since we don't do it, there is no way to know if the EEXIST was thrown from a symlink in src, or a symlink in destination that looks into a path in destination folder. The test cases are same, but they throw different errors.

Wouldn't that only hurt the performance in the error case which isn't the interesting one?

We can catch it, but the problem is, we don't know which file is causing this error.

anonrig avatar Jun 25 '24 20:06 anonrig

Thanks, I'm +1 regarding the change and the breakage since I doubt people using cpSync are checking error codes this granularly. We may want to be prudent and trigger a CITGM run but I don't expect it'll find much.

(It should still be semver major though)

benjamingr avatar Jun 25 '24 20:06 benjamingr

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

@benjamingr Actually, there is a std::filesystem::filesystem_error which I'm not sure how to catch it without try/catch. https://en.cppreference.com/w/cpp/filesystem/filesystem_error

anonrig avatar Jun 25 '24 20:06 anonrig

Wait actually reading the code I think I understand since std::filesystem doesn't distinguish between those cases.

@benjamingr Actually, there is a std::filesystem::filesystem_error which I'm not sure how to catch it without try/catch. https://en.cppreference.com/w/cpp/filesystem/filesystem_error

I love systems engineering. We can't use try/catch because we are passing -no-exceptions flag to GCC.

../../src/node_file.cc:2264:3: error: cannot use 'try' with exceptions disabled
  try {
  ^
1 error generated.

anonrig avatar Jun 25 '24 20:06 anonrig

That's for the overload that doesn't contain the error_code - and has the same info as the error_code so if it isn't there it doesn't distinguish between those cases. (Also CI should tell us that std::filesystem::copy gives these errors in a cross-platform friendly way)

benjamingr avatar Jun 25 '24 20:06 benjamingr

That's for the overload that doesn't contain the error_code - and has the same info as the error_code so if it isn't there it doesn't distinguish between those cases. (Also CI should tell us that std::filesystem::copy gives these errors in a cross-platform friendly way)

Even with that overloaded function, the following code does not compile due to disabled exception handling @benjamingr

  try {
    std::filesystem::copy(src_path, dest_path, options);
  } catch (const std::filesystem::filesystem_error& error) {
    if (error.code() == std::errc::file_exists) {
      std::string message = "File already exists";
      return THROW_ERR_FS_CP_EEXIST(env, message.c_str());
    }

     std::string message = "Unhandled error " +
                          std::to_string(error_code.value()) + ": " +
                          error_code.message();
    return THROW_ERR_FS_CP_EINVAL(env, message.c_str());
  }

anonrig avatar Jun 25 '24 20:06 anonrig

@anonrig you misunderstand me, sorry - I mean filesystem::copy has several overloads. The one you used in this PR already takes an error code reference here).

If you do that, it doesn't throw and returns the error code through the parameter instead. If you remove that parameter it will throw.

(There are valid important reasons why exceptions are disabled in the codebase, but feel free to disable that for testing)

benjamingr avatar Jun 25 '24 20:06 benjamingr

@anonrig you misunderstand me, sorry - I mean filesystem::copy has several overloads. The one you used in this PR already takes an error code reference here).

If you do that, it doesn't throw and returns the error code through the parameter instead. If you remove that parameter it will throw.

(There are valid important reasons why exceptions are disabled in the codebase, but feel free to disable that for testing)

Yes, unfortunately std::error_code does not have path() property, but std::filesystem::filesystem_error does, and with exception handling disabled, it's not possible to get those paths, and make this a non-semver major pull-request.

GGYry8sXEAATdHB

anonrig avatar Jun 25 '24 21:06 anonrig

LGTM with green CI. Since the implementation was adjusted, benchmark rerun would be appreciated.

@LiviaMedeiros benchmark ci: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1572/

anonrig avatar Jun 25 '24 23:06 anonrig

CI: https://ci.nodejs.org/job/node-test-pull-request/59968/

nodejs-github-bot avatar Jun 25 '24 23:06 nodejs-github-bot

I'm a little afraid of having a branch with two completely separate implementations of the feature depending on the options passed by the user. This LGTM if the end goal is to get rid of the JS one.

targos avatar Jun 26 '24 09:06 targos

CI: https://ci.nodejs.org/job/node-test-pull-request/60176/

nodejs-github-bot avatar Jul 08 '24 18:07 nodejs-github-bot