jdk 8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance

Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.

Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:

Disable coherent icache.
Trap IC IVAU instructions.
Execute:
- tlbi vae3is, xzr
- dsb sy

tlbi vae3is, xzr invalidates translations for all address spaces (global for address). It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.

As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:

"Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."

This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.

Changes include:

Added a new diagnostic JVM flag NeoverseN1Errata1542419 to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization.
Introduced the ICacheInvalidationContext class to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact. As the address for icache invalidation is not relevant, we use the nmethod's code start address.
Provided a default (no-op) implementation for ICacheInvalidationContext on platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures.
Modified barrier patching and relocation logic (ZBarrierSetAssembler, ZNMethod, RelocIterator, and related code) to accept a defer_icache_invalidation parameter, allowing ICache invalidation to be deferred and later performed in bulk.

Testing results: linux fastdebug build

Neoverse-N1 (Graviton 2)
- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 3 failures
  - containers/docker/TestJcmdWithSideCar.java: JDK-8341518
  - com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
  - java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
Neoverse-N1 (Graviton 2), -XX:-NeoverseN1Errata1542419
- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 3 failures
  - containers/docker/TestJcmdWithSideCar.java: JDK-8341518
  - com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
  - java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
x86_64
- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 2 failures
  - com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466
  - java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026

Benchmarking results: Neoverse-N1 r3p1 (Graviton 2), 5 000 nmethods, 1 parallel thread, 1 concurrent thread, 3 forks, ms/op (lower better)

ZGC

Benchmark	accessedFieldCount	Baseline	Error (Baseline)	Fix	Error (Fix)	Improvement
fullGC	0	68.02	35.221	59.448	28.674	12.6%
fullGC	2	648.395	26.417	142.566	90.82	78.0%
fullGC	4	1209.539	163.909	181.565	17.419	85.0%
fullGC	8	2313.042	40.243	258.854	14.707	88.8%
systemGC	0	69.862	16.74	63.433	12.976	9.2%
systemGC	2	653.807	36.922	142.675	13.086	78.2%
systemGC	4	1212.661	126.471	183.992	16.438	84.8%
systemGC	8	2305.342	377.801	262.416	61.388	88.6%
youngGC	0	7.89	4.68	4.881	0.793	38.1%
youngGC	2	158.655	3.468	17.675	2.377	88.9%
youngGC	4	314.051	89.393	25.555	7.368	91.9%
youngGC	8	614.067	205.207	38.715	3.085	93.7%

G1GC

Benchmark	accessedFieldCount	Baseline	Error (Baseline)	Fix	Error (Fix)	Improvement
fullGC	0	34.526	3.841	33.561	4.284	2.8%
fullGC	2	132.489	7.516	60.922	3.285	54.0%
fullGC	4	215.601	16.953	69.777	15.359	67.6%
fullGC	8	392.611	106.065	95.151	1.118	75.8%
systemGC	0	35.906	8.445	35.455	10.434	1.3%
systemGC	2	129.554	18.165	56.986	18.609	56.0%
systemGC	4	220.130	29.353	71.640	5.775	67.5%
systemGC	8	396.455	111.161	97.731	2.349	75.3%
youngGC	0	0.642	0.257	0.685	0.290	-6.7%
youngGC	2	0.835	0.614	0.753	0.457	9.8%
youngGC	4	0.870	0.196	0.767	0.445	11.8%
youngGC	8	0.907	0.353	0.786	0.558	13.3%

ParallelGC

Benchmark	accessedFieldCount	Baseline	Error (Baseline)	Fix	Error (Fix)	Improvement
fullGC	0	24.158	5.731	23.700	2.172	1.9%
fullGC	2	116.949	18.407	45.582	2.400	61.0%
fullGC	4	205.175	24.970	59.216	3.524	71.1%
fullGC	8	380.122	51.809	83.765	7.697	78.0%
systemGC	0	24.518	3.683	24.100	4.200	1.7%
systemGC	2	118.792	31.924	46.404	1.894	60.9%
systemGC	4	205.749	52.843	59.857	8.125	70.9%
systemGC	8	382.396	97.584	84.494	3.228	77.9%
youngGC	0	0.325	0.153	0.329	0.039	-1.2%
youngGC	2	0.407	0.071	0.405	0.027	0.5%
youngGC	4	0.405	0.156	0.398	0.030	1.7%
youngGC	8	0.451	0.095	0.405	0.046	10.2%

GenShenGC

Benchmark	accessedFieldCount	Baseline	Error (Baseline)	Fix	Error (Fix)	Improvement
fullGC	0	32.799	1.287	32.638	16.186	0.5%
fullGC	2	130.388	12.179	59.565	27.384	54.3%
fullGC	4	221.452	12.654	71.501	3.271	67.7%
fullGC	8	397.024	93.899	96.542	3.335	75.7%
systemGC	0	34.699	16.677	33.430	1.115	3.7%
systemGC	2	131.343	30.426	59.200	4.173	54.9%
systemGC	4	220.197	48.079	72.529	1.808	67.1%
systemGC	8	397.017	71.641	98.212	3.428	75.3%
youngGC	0	35.788	18.645	33.368	2.175	6.8%
youngGC	2	133.873	30.916	59.600	1.066	55.5%
youngGC	4	220.862	45.541	74.423	53.835	66.3%
youngGC	8	403.135	153.634	98.080	5.118	75.7%

Progress

[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue
[ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)

Issue

JDK-8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance (Enhancement - P2)

Reviewers

Axel Boldt-Christmas (@xmas92 - Reviewer) 🔄 Re-review required (review applies to 18476044)

Contributors

Axel Boldt-Christmas <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28328/head:pull/28328
$ git checkout pull/28328

Update a local copy of the PR:
$ git checkout pull/28328
$ git pull https://git.openjdk.org/jdk.git pull/28328/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28328

View PR using the GUI difftool:
$ git pr show -t 28328

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28328.diff

Using Webrev

Link to Webrev Comment

Nov 14 '25 18:11 eastig

Based on https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation

Nov 14 '25 18:11 eastig

:wave: Welcome back eastigeevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Nov 14 '25 18:11 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

Nov 14 '25 18:11 openjdk[bot]

/contributor add xmas92

Nov 14 '25 18:11 eastig

@eastig xmas92 was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

Nov 14 '25 18:11 openjdk[bot]

@eastig The following label will be automatically applied to this pull request:

hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

Nov 14 '25 18:11 openjdk[bot]

/contributor add aboldtch

Nov 14 '25 18:11 eastig

@eastig Contributor Axel Boldt-Christmas <[email protected]> successfully added.

Nov 14 '25 18:11 openjdk[bot]

Hi @fisk @theRealAph @xmas92 @shipilev

I created this draft PR based on @xmas92 work https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation

Alex wrote about his implementation in JDK-8370947:

The implementation I linked is very aarch64 centric. I would like to create a bit nicer abstraction for this to allow easier adaption for other platforms.

I see his changes touch other backends. I tried to minimize changes and to avoid them in other backends. I don't think the concept of deferred icache invalidation will be used anywhere but for Neoverse-N1 errata.

This PR does not cover all cases in ZGC at the moment. It can be done as soon as we agree with a proper way to fix.

I'd like to hear your opinion which way we should choose:

Abstraction of deferred icache invalidation supported in all backends.
Concrete implementation focused on Neoverse-N1.

Nov 14 '25 18:11 eastig

@eastig this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8370947
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Nov 20 '25 14:11 openjdk[bot]

/issue JDK-8370947

Nov 20 '25 14:11 eastig

@eastig This issue is referenced in the PR title - it will now be updated.

Nov 20 '25 14:11 openjdk[bot]

Webrevs

Nov 20 '25 14:11 mlbridge[bot]

@fisk, @xmas92 I added a JMH microbenchmark and its results from Graviton 2. I added deferred icache invalidation to ZGC where it's needed.

I found one place where I am not sure: https://github.com/openjdk/jdk/blob/b9ee9541cffb6c5a737b08a69ae04472b3bcab3e/src/hotspot/share/gc/z/zHeapIterator.cpp#L364

class ZHeapIteratorNMethodClosure : public NMethodClosure {
private:
  OopClosure* const        _cl;
  BarrierSetNMethod* const _bs_nm;

public:
  ZHeapIteratorNMethodClosure(OopClosure* cl)
    : _cl(cl),
      _bs_nm(BarrierSet::barrier_set()->barrier_set_nmethod()) {}

  virtual void do_nmethod(nmethod* nm) {
    // If ClassUnloading is turned off, all nmethods are considered strong,
    // not only those on the call stacks. The heap iteration might happen
    // before the concurrent processing of the code cache, make sure that
    // all nmethods have been processed before visiting the oops.
    _bs_nm->nmethod_entry_barrier(nm);

    ZNMethod::nmethod_oops_do(nm, _cl);
  }
};

For _bs_nm->nmethod_entry_barrier(nm) we use deferred icache invalidation. I'm not sure it is safe for ZNMethod::nmethod_oops_do(nm, _cl). It looks like it is safe because the code is called at a safepoint: ZHeap::object_iterate, ZHeap::object_and_field_iterate_for_verify and ZHeap::parallel_object_iterator.

Nov 20 '25 15:11 eastig

@xmas92 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

Nov 20 '25 15:11 openjdk[bot]

@fisk, @xmas92 The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:

Benchmark                       Score     Error  Units
GCPatchingNmethodCost.fullGC:base                       73.937 ±  17.764  ms/op
GCPatchingNmethodCost.systemGC:base                     77.495 ±  11.963  ms/op
GCPatchingNmethodCost.youngGC:base                      9.955 ±   1.649  ms/op
GCPatchingNmethodCost.fullGC:fix                        88.865 ± 19.299  ms/op +20.1%
GCPatchingNmethodCost.systemGC:fix                      90.572 ± 14.750  ms/op +16.9%
GCPatchingNmethodCost.youngGC:fix                       10.219 ±  0.877  ms/op +2.7%

Nov 20 '25 15:11 eastig

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

Nov 20 '25 16:11 theRealAph

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

Also we need to fix G1NMethodClosure::do_evacuation_and_fixup and ShenandoahNMethod::oops_do. They use nmethod::fix_oop_relocations.

Should we do it in this PR or in separate PRs?

Nov 20 '25 17:11 eastig

I think we'll also want a workaround for CodeBuffer::relocate_code_to().

Also we need to fix G1NMethodClosure::do_evacuation_and_fixup and ShenandoahNMethod::oops_do. They use nmethod::fix_oop_relocations.

Should we do it in this PR or in separate PRs?

Please, let's handle it all here.

Nov 20 '25 19:11 theRealAph

It seems a little disruptive to have to pass defer_icache_invalidation around so much. What about attaching this information to the Thread or using a THREAD_LOCAL?

Nov 20 '25 21:11 dean-long

@xmas92

The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:

Benchmark                       Score     Error  Units
GCPatchingNmethodCost.fullGC:base                       73.937 ±  17.764  ms/op
GCPatchingNmethodCost.systemGC:base                     77.495 ±  11.963  ms/op
GCPatchingNmethodCost.youngGC:base                      9.955 ±   1.649  ms/op
GCPatchingNmethodCost.fullGC:fix                        88.865 ± 19.299  ms/op +20.1%
GCPatchingNmethodCost.systemGC:fix                      90.572 ± 14.750  ms/op +16.9%
GCPatchingNmethodCost.youngGC:fix                       10.219 ±  0.877  ms/op +2.7%

I think I might have an idea what causes the regressions. I'll be debugging it.

Nov 21 '25 13:11 eastig

It seems a little disruptive to have to pass defer_icache_invalidation around so much. What about attaching this information to the Thread or using a THREAD_LOCAL?

I switched to a THREAD_LOCAL. Initially it regressed fullGG comparing to the version with the parameter:

Parameter:

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   88.865 ± 19.299  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  146.184 ± 11.531  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  186.429 ± 16.257  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  262.933 ± 13.071  ms/op

THREAD_LOCAL

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score     Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   93.899 ±  14.870  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  152.872 ±  13.566  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  194.425 ±  37.851  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  271.826 ±  47.908  ms/op

I found that ZBarrierSetAssembler::patch_barrier_relocation is only used when icache invalidation is deferred. I replaced a check of the thread local value with a check of NeoverseN1Errata1542419. This restored the performance:

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score     Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   84.919 ±  31.411  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3  141.862 ±   7.026  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  184.921 ±  46.592  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  263.897 ± 48.271  ms/op

It might be that accesses to THREAD_LOCAL on Neoverse N1 are expensive.

Should I try attaching info to Thread?

Nov 21 '25 22:11 eastig

Should I try attaching info to Thread?

It probably won't help, because Thread::current() is going to access a thread-local.

Nov 22 '25 00:11 dean-long

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

Nov 22 '25 00:11 dean-long

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

Nov 23 '25 10:11 fisk

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?

Nov 23 '25 14:11 theRealAph

If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?

We can't disarm an nmethod before flushing the instructions.

I don't think we flush the whole icache. We invalidate all translation entries (all VAs, all possible levels). I have not found any information that it would flush icache. I think TLBI is used as a heavyweight serialization barrier. It might force cores to synchronize their instruction fetch streams. We broadcast a TLB invalidation and wait for its completion. I think hardware data and instruction cache coherence still work.

I also found https://lore.kernel.org/linux-arm-kernel/[email protected]/ with more details on the errata workaround. These details look aligned with the hypothesis of a synchronization event to enforce ordering.

The problem is:

Neoverse-N1 cores with the 'COHERENT_ICACHE' feature may fetch stale instructions when software depends on prefetch-speculation-protection instead of explicit synchronization.

Prefetch-speculation-protection:

JIT can generate new instructions at some new location, then update a branch in the executable instructions to point at the new location.

Prefetch-speculation-protection guarantees that if another CPU sees the new branch, it also sees the new instructions that were written there.

I think, in the case of armed/disarmed nmethods we have explicit synchronization not prefetch-speculation-protection. Neither of thread execute armed nmethods. If I am correct, disarming is a process of releasing nmethod to allow its execution.

Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?

If my understanding is correct, we cannot disarm before flushing because disarming is like a release of a critical section. We must guarantee all changes we've made are visible to all observers when we leave the critical section.

As I wrote in the JBS issue we can:

Get all nmethods armed
Patch all of them
Invalidate TLB
Get all nmethods disarmed

This will complicate the fix a lot. Performance gain from is not worth. I measured theoretic performance when we don't do any invalidation. It's 3% - 4% better than the approach in this PR: invalidate TLB per nmethod.

Nov 24 '25 15:11 eastig

Yeah patching all nmethods as one unit is basically equivalent to making the code cache processing a STW operation. Last time we processed the code cache STW was JDK 11. A dark place I don't want to go back to. It can get pretty big and mess up latency. So I'm in favour of limiting the fix and not re-introduce STW code cache processing.

Otherwise yes you are correct; we perform synchronous cross modifying code with no assumptions about instruction cache coherency because we didn't trust it would actually work for all ARM implementations. Seems like that was a good bet. We rely on it on x64 still though.

It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so, a horrible alternative might be to use mprotect to temporarily remove execution permission on the affected per nmethod pages, and detect over shooting in the signal handler, resuming execution when execution privileges are then restored immediately after. That should limit the affected VA to close to what is actually invalidated. But it would look horrible.

Nov 24 '25 20:11 fisk

It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so,

"Because the cache-maintenance wasn't needed, we can do the TLBI instead. In fact, the I-Cache line-size isn't relevant anymore, we can reduce the number of traps by producing a fake value.

"For user-space, the kernel's work is now to trap CTR_EL0 to hide DIC, and produce a fake IminLine. EL3 traps the now-necessary I-Cache maintenance and performs the inner-shareable-TLBI that makes everything better."

My interpretation of this is that we only need to do the synchronization dance once, at the end of the patching. But I guess we don't know exactly if we have an affected core or if the kernel workaround is in action.

Nov 25 '25 13:11 theRealAph

@xmas92 I fixed regressions for Java methods without field accesses I saw:

-XX:+NeoverseN1Errata1542419 before the fix

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   88.865 ± 19.299  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3   90.572 ± 14.750  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   10.219 ±  0.877  ms/op

-XX:+NeoverseN1Errata1542419 after the fix

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt   Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3  60.847 ± 23.735  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3  62.338 ±  5.663  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   4.956 ±  1.440  ms/op

-XX:-NeoverseN1Errata1542419

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt   Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3  67.144 ± 15.187  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3  70.181 ± 30.271  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3   7.906 ±  2.118  ms/op

I'll check SpecJVM as well.

Nov 26 '25 17:11 eastig