jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance

Open eastig opened this issue 2 months ago • 35 comments

Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.

Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:

  • Disable coherent icache.
  • Trap IC IVAU instructions.
  • Execute:
    • tlbi vae3is, xzr
    • dsb sy

tlbi vae3is, xzr invalidates translations for all address spaces (global for address). It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.

As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:

"Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."

This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.

Changes include:

  • Added a new diagnostic JVM flag NeoverseN1Errata1542419 to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization.
  • Introduced the ICacheInvalidationContext class to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact. As the address for icache invalidation is not relevant, we use the nmethod's code start address.
  • Provided a default (no-op) implementation for ICacheInvalidationContext on platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures.
  • Modified barrier patching and relocation logic (ZBarrierSetAssembler, ZNMethod, RelocIterator, and related code) to accept a defer_icache_invalidation parameter, allowing ICache invalidation to be deferred and later performed in bulk.

Testing results: linux fastdebug build

  • Neoverse-N1 (Graviton 2)
    • [x] tier1: passed
    • [x] tier2: passed
    • [x] tier3: passed
    • [x] tier4: 3 failures
      • containers/docker/TestJcmdWithSideCar.java: JDK-8341518
      • com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
      • java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
  • Neoverse-N1 (Graviton 2), -XX:-NeoverseN1Errata1542419
    • [x] tier1: passed
    • [x] tier2: passed
    • [x] tier3: passed
    • [x] tier4: 3 failures
      • containers/docker/TestJcmdWithSideCar.java: JDK-8341518
      • com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
      • java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
  • x86_64
    • [x] tier1: passed
    • [x] tier2: passed
    • [x] tier3: passed
    • [x] tier4: 2 failures
      • com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
      • java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026

Benchmarking results: Neoverse-N1 r3p1 (Graviton 2), 5 000 nmethods, 1 parallel thread, 1 concurrent thread, 3 forks, ms/op (lower better)

  • ZGC
Benchmark accessedFieldCount Baseline Error (Baseline) Fix Error (Fix) Improvement
fullGC 0 68.02 35.221 59.448 28.674 12.6%
fullGC 2 648.395 26.417 142.566 90.82 78.0%
fullGC 4 1209.539 163.909 181.565 17.419 85.0%
fullGC 8 2313.042 40.243 258.854 14.707 88.8%
systemGC 0 69.862 16.74 63.433 12.976 9.2%
systemGC 2 653.807 36.922 142.675 13.086 78.2%
systemGC 4 1212.661 126.471 183.992 16.438 84.8%
systemGC 8 2305.342 377.801 262.416 61.388 88.6%
youngGC 0 7.89 4.68 4.881 0.793 38.1%
youngGC 2 158.655 3.468 17.675 2.377 88.9%
youngGC 4 314.051 89.393 25.555 7.368 91.9%
youngGC 8 614.067 205.207 38.715 3.085 93.7%
  • G1GC
Benchmark accessedFieldCount Baseline Error (Baseline) Fix Error (Fix) Improvement
fullGC 0 34.526 3.841 33.561 4.284 2.8%
fullGC 2 132.489 7.516 60.922 3.285 54.0%
fullGC 4 215.601 16.953 69.777 15.359 67.6%
fullGC 8 392.611 106.065 95.151 1.118 75.8%
systemGC 0 35.906 8.445 35.455 10.434 1.3%
systemGC 2 129.554 18.165 56.986 18.609 56.0%
systemGC 4 220.130 29.353 71.640 5.775 67.5%
systemGC 8 396.455 111.161 97.731 2.349 75.3%
youngGC 0 0.642 0.257 0.685 0.290 -6.7%
youngGC 2 0.835 0.614 0.753 0.457 9.8%
youngGC 4 0.870 0.196 0.767 0.445 11.8%
youngGC 8 0.907 0.353 0.786 0.558 13.3%
  • ParallelGC
Benchmark accessedFieldCount Baseline Error (Baseline) Fix Error (Fix) Improvement
fullGC 0 24.158 5.731 23.700 2.172 1.9%
fullGC 2 116.949 18.407 45.582 2.400 61.0%
fullGC 4 205.175 24.970 59.216 3.524 71.1%
fullGC 8 380.122 51.809 83.765 7.697 78.0%
systemGC 0 24.518 3.683 24.100 4.200 1.7%
systemGC 2 118.792 31.924 46.404 1.894 60.9%
systemGC 4 205.749 52.843 59.857 8.125 70.9%
systemGC 8 382.396 97.584 84.494 3.228 77.9%
youngGC 0 0.325 0.153 0.329 0.039 -1.2%
youngGC 2 0.407 0.071 0.405 0.027 0.5%
youngGC 4 0.405 0.156 0.398 0.030 1.7%
youngGC 8 0.451 0.095 0.405 0.046 10.2%
  • GenShenGC
Benchmark accessedFieldCount Baseline Error (Baseline) Fix Error (Fix) Improvement
fullGC 0 32.799 1.287 32.638 16.186 0.5%
fullGC 2 130.388 12.179 59.565 27.384 54.3%
fullGC 4 221.452 12.654 71.501 3.271 67.7%
fullGC 8 397.024 93.899 96.542 3.335 75.7%
systemGC 0 34.699 16.677 33.430 1.115 3.7%
systemGC 2 131.343 30.426 59.200 4.173 54.9%
systemGC 4 220.197 48.079 72.529 1.808 67.1%
systemGC 8 397.017 71.641 98.212 3.428 75.3%
youngGC 0 35.788 18.645 33.368 2.175 6.8%
youngGC 2 133.873 30.916 59.600 1.066 55.5%
youngGC 4 220.862 45.541 74.423 53.835 66.3%
youngGC 8 403.135 153.634 98.080 5.118 75.7%

Progress

  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

  • JDK-8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance (Enhancement - P2)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28328/head:pull/28328
$ git checkout pull/28328

Update a local copy of the PR:
$ git checkout pull/28328
$ git pull https://git.openjdk.org/jdk.git pull/28328/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28328

View PR using the GUI difftool:
$ git pr show -t 28328

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28328.diff

Using Webrev

Link to Webrev Comment

eastig avatar Nov 14 '25 18:11 eastig

Based on https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation

eastig avatar Nov 14 '25 18:11 eastig

:wave: Welcome back eastigeevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Nov 14 '25 18:11 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Nov 14 '25 18:11 openjdk[bot]

/contributor add xmas92

eastig avatar Nov 14 '25 18:11 eastig

@eastig xmas92 was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

  • /contributor add @openjdk-bot
  • /contributor add duke
  • /contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

openjdk[bot] avatar Nov 14 '25 18:11 openjdk[bot]

@eastig The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Nov 14 '25 18:11 openjdk[bot]

/contributor add aboldtch

eastig avatar Nov 14 '25 18:11 eastig

@eastig Contributor Axel Boldt-Christmas <[email protected]> successfully added.

openjdk[bot] avatar Nov 14 '25 18:11 openjdk[bot]

Hi @fisk @theRealAph @xmas92 @shipilev

I created this draft PR based on @xmas92 work https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation

Alex wrote about his implementation in JDK-8370947:

The implementation I linked is very aarch64 centric. I would like to create a bit nicer abstraction for this to allow easier adaption for other platforms.

I see his changes touch other backends. I tried to minimize changes and to avoid them in other backends. I don't think the concept of deferred icache invalidation will be used anywhere but for Neoverse-N1 errata.

This PR does not cover all cases in ZGC at the moment. It can be done as soon as we agree with a proper way to fix.

I'd like to hear your opinion which way we should choose:

  • Abstraction of deferred icache invalidation supported in all backends.
  • Concrete implementation focused on Neoverse-N1.

eastig avatar Nov 14 '25 18:11 eastig

@eastig this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8370947
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Nov 20 '25 14:11 openjdk[bot]

/issue JDK-8370947

eastig avatar Nov 20 '25 14:11 eastig

@eastig This issue is referenced in the PR title - it will now be updated.

openjdk[bot] avatar Nov 20 '25 14:11 openjdk[bot]

@fisk, @xmas92 I added a JMH microbenchmark and its results from Graviton 2. I added deferred icache invalidation to ZGC where it's needed.

I found one place where I am not sure: https://github.com/openjdk/jdk/blob/b9ee9541cffb6c5a737b08a69ae04472b3bcab3e/src/hotspot/share/gc/z/zHeapIterator.cpp#L364

class ZHeapIteratorNMethodClosure : public NMethodClosure {
private:
  OopClosure* const        _cl;
  BarrierSetNMethod* const _bs_nm;

public:
  ZHeapIteratorNMethodClosure(OopClosure* cl)
    : _cl(cl),
      _bs_nm(BarrierSet::barrier_set()->barrier_set_nmethod()) {}

  virtual void do_nmethod(nmethod* nm) {
    // If ClassUnloading is turned off, all nmethods are considered strong,
    // not only those on the call stacks. The heap iteration might happen
    // before the concurrent processing of the code cache, make sure that
    // all nmethods have been processed before visiting the oops.
    _bs_nm->nmethod_entry_barrier(nm);

    ZNMethod::nmethod_oops_do(nm, _cl);
  }
};

For _bs_nm->nmethod_entry_barrier(nm) we use deferred icache invalidation. I'm not sure it is safe for ZNMethod::nmethod_oops_do(nm, _cl). It looks like it is safe because the code is called at a safepoint: ZHeap::object_iterate, ZHeap::object_and_field_iterate_for_verify and ZHeap::parallel_object_iterator.

eastig avatar Nov 20 '25 15:11 eastig

@xmas92 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

openjdk[bot] avatar Nov 20 '25 15:11 openjdk[bot]

@fisk, @xmas92 The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:

Benchmark                       Score     Error  Units
GCPatchingNmethodCost.fullGC:base                       73.937 ±  17.764  ms/op
GCPatchingNmethodCost.systemGC:base                     77.495 ±  11.963  ms/op
GCPatchingNmethodCost.youngGC:base                      9.955 ±   1.649  ms/op
GCPatchingNmethodCost.fullGC:fix                        88.865 ± 19.299  ms/op +20.1%
GCPatchingNmethodCost.systemGC:fix                      90.572 ± 14.750  ms/op +16.9%
GCPatchingNmethodCost.youngGC:fix                       10.219 ±  0.877  ms/op +2.7%

eastig avatar Nov 20 '25 15:11 eastig

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

theRealAph avatar Nov 20 '25 16:11 theRealAph

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

Also we need to fix G1NMethodClosure::do_evacuation_and_fixup and ShenandoahNMethod::oops_do. They use nmethod::fix_oop_relocations.

Should we do it in this PR or in separate PRs?

eastig avatar Nov 20 '25 17:11 eastig

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

Also we need to fix G1NMethodClosure::do_evacuation_and_fixup and ShenandoahNMethod::oops_do. They use nmethod::fix_oop_relocations.

Should we do it in this PR or in separate PRs?

Please, let's handle it all here.

theRealAph avatar Nov 20 '25 19:11 theRealAph

It seems a little disruptive to have to pass defer_icache_invalidation around so much. What about attaching this information to the Thread or using a THREAD_LOCAL?

dean-long avatar Nov 20 '25 21:11 dean-long

@xmas92

The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:

Benchmark                       Score     Error  Units
GCPatchingNmethodCost.fullGC:base                       73.937 ±  17.764  ms/op
GCPatchingNmethodCost.systemGC:base                     77.495 ±  11.963  ms/op
GCPatchingNmethodCost.youngGC:base                      9.955 ±   1.649  ms/op
GCPatchingNmethodCost.fullGC:fix                        88.865 ± 19.299  ms/op +20.1%
GCPatchingNmethodCost.systemGC:fix                      90.572 ± 14.750  ms/op +16.9%
GCPatchingNmethodCost.youngGC:fix                       10.219 ±  0.877  ms/op +2.7%

I think I might have an idea what causes the regressions. I'll be debugging it.

eastig avatar Nov 21 '25 13:11 eastig

It seems a little disruptive to have to pass defer_icache_invalidation around so much. What about attaching this information to the Thread or using a THREAD_LOCAL?

I switched to a THREAD_LOCAL. Initially it regressed fullGG comparing to the version with the parameter:

  • Parameter:
Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   88.865 ± 19.299  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  146.184 ± 11.531  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  186.429 ± 16.257  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  262.933 ± 13.071  ms/op
  • THREAD_LOCAL
Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score     Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   93.899 ±  14.870  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  152.872 ±  13.566  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  194.425 ±  37.851  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  271.826 ±  47.908  ms/op

I found that ZBarrierSetAssembler::patch_barrier_relocation is only used when icache invalidation is deferred. I replaced a check of the thread local value with a check of NeoverseN1Errata1542419. This restored the performance:

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score     Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   84.919 ±  31.411  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  141.862 ±   7.026  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  184.921 ±  46.592  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  263.897 ± 48.271  ms/op

It might be that accesses to THREAD_LOCAL on Neoverse N1 are expensive.

Should I try attaching info to Thread?

eastig avatar Nov 21 '25 22:11 eastig

Should I try attaching info to Thread?

It probably won't help, because Thread::current() is going to access a thread-local.

dean-long avatar Nov 22 '25 00:11 dean-long

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

dean-long avatar Nov 22 '25 00:11 dean-long

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

fisk avatar Nov 23 '25 10:11 fisk

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?

theRealAph avatar Nov 23 '25 14:11 theRealAph

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

I don't think we flush the whole icache. We invalidate all translation entries (all VAs, all possible levels). I have not found any information that it would flush icache. I think TLBI is used as a heavyweight serialization barrier. It might force cores to synchronize their instruction fetch streams. We broadcast a TLB invalidation and wait for its completion. I think hardware data and instruction cache coherence still work.

I also found https://lore.kernel.org/linux-arm-kernel/[email protected]/ with more details on the errata workaround. These details look aligned with the hypothesis of a synchronization event to enforce ordering.

The problem is:

Neoverse-N1 cores with the 'COHERENT_ICACHE' feature may fetch stale instructions when software depends on prefetch-speculation-protection instead of explicit synchronization.

Prefetch-speculation-protection:

JIT can generate new instructions at some new location, then update a branch in the executable instructions to point at the new location.

Prefetch-speculation-protection guarantees that if another CPU sees the new branch, it also sees the new instructions that were written there.

I think, in the case of armed/disarmed nmethods we have explicit synchronization not prefetch-speculation-protection. Neither of thread execute armed nmethods. If I am correct, disarming is a process of releasing nmethod to allow its execution.

Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?

If my understanding is correct, we cannot disarm before flushing because disarming is like a release of a critical section. We must guarantee all changes we've made are visible to all observers when we leave the critical section.

As I wrote in the JBS issue we can:

  • Get all nmethods armed
  • Patch all of them
  • Invalidate TLB
  • Get all nmethods disarmed

This will complicate the fix a lot. Performance gain from is not worth. I measured theoretic performance when we don't do any invalidation. It's 3% - 4% better than the approach in this PR: invalidate TLB per nmethod.

eastig avatar Nov 24 '25 15:11 eastig

Yeah patching all nmethods as one unit is basically equivalent to making the code cache processing a STW operation. Last time we processed the code cache STW was JDK 11. A dark place I don't want to go back to. It can get pretty big and mess up latency. So I'm in favour of limiting the fix and not re-introduce STW code cache processing.

Otherwise yes you are correct; we perform synchronous cross modifying code with no assumptions about instruction cache coherency because we didn't trust it would actually work for all ARM implementations. Seems like that was a good bet. We rely on it on x64 still though.

It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so, a horrible alternative might be to use mprotect to temporarily remove execution permission on the affected per nmethod pages, and detect over shooting in the signal handler, resuming execution when execution privileges are then restored immediately after. That should limit the affected VA to close to what is actually invalidated. But it would look horrible.

fisk avatar Nov 24 '25 20:11 fisk

It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so,

"Because the cache-maintenance wasn't needed, we can do the TLBI instead. In fact, the I-Cache line-size isn't relevant anymore, we can reduce the number of traps by producing a fake value.

"For user-space, the kernel's work is now to trap CTR_EL0 to hide DIC, and produce a fake IminLine. EL3 traps the now-necessary I-Cache maintenance and performs the inner-shareable-TLBI that makes everything better."

My interpretation of this is that we only need to do the synchronization dance once, at the end of the patching. But I guess we don't know exactly if we have an affected core or if the kernel workaround is in action.

theRealAph avatar Nov 25 '25 13:11 theRealAph

@xmas92 I fixed regressions for Java methods without field accesses I saw:

  • -XX:+NeoverseN1Errata1542419 before the fix
Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   88.865 ± 19.299  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3   90.572 ± 14.750  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   10.219 ±  0.877  ms/op
  • -XX:+NeoverseN1Errata1542419 after the fix
Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt   Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3  60.847 ± 23.735  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3  62.338 ±  5.663  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   4.956 ±  1.440  ms/op
  • -XX:-NeoverseN1Errata1542419
Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt   Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3  67.144 ± 15.187  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3  70.181 ± 30.271  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   7.906 ±  2.118  ms/op

I'll check SpecJVM as well.

eastig avatar Nov 26 '25 17:11 eastig