8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance
Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.
Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:
- Disable coherent icache.
- Trap IC IVAU instructions.
- Execute:
-
tlbi vae3is, xzr -
dsb sy
-
tlbi vae3is, xzr invalidates translations for all address spaces (global for address). It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.
As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:
"Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."
This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.
Changes include:
- Added a new diagnostic JVM flag
NeoverseN1Errata1542419to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization. - Introduced the
ICacheInvalidationContextclass to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact. As the address for icache invalidation is not relevant, we use the nmethod's code start address. - Provided a default (no-op) implementation for
ICacheInvalidationContexton platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures. - Modified barrier patching and relocation logic (
ZBarrierSetAssembler,ZNMethod,RelocIterator, and related code) to accept adefer_icache_invalidationparameter, allowing ICache invalidation to be deferred and later performed in bulk.
Testing results: linux fastdebug build
- Neoverse-N1 (Graviton 2)
- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 3 failures
-
containers/docker/TestJcmdWithSideCar.java: JDK-8341518 -
com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466 -
java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
-
- Neoverse-N1 (Graviton 2),
-XX:-NeoverseN1Errata1542419- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 3 failures
-
containers/docker/TestJcmdWithSideCar.java: JDK-8341518 -
com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466 -
java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
-
- x86_64
- [x] tier1: passed
- [x] tier2: passed
- [x] tier3: passed
- [x] tier4: 2 failures
-
com/sun/nio/sctp/SctpChannel/CloseDescriptors.java: JDK-8298466 -
java/awt/print/PrinterJob/PrintNullString.java: JDK-8333026
-
Benchmarking results: Neoverse-N1 r3p1 (Graviton 2), 5 000 nmethods, 1 parallel thread, 1 concurrent thread, 3 forks, ms/op (lower better)
- ZGC
| Benchmark | accessedFieldCount | Baseline | Error (Baseline) | Fix | Error (Fix) | Improvement |
|---|---|---|---|---|---|---|
| fullGC | 0 | 68.02 | 35.221 | 59.448 | 28.674 | 12.6% |
| fullGC | 2 | 648.395 | 26.417 | 142.566 | 90.82 | 78.0% |
| fullGC | 4 | 1209.539 | 163.909 | 181.565 | 17.419 | 85.0% |
| fullGC | 8 | 2313.042 | 40.243 | 258.854 | 14.707 | 88.8% |
| systemGC | 0 | 69.862 | 16.74 | 63.433 | 12.976 | 9.2% |
| systemGC | 2 | 653.807 | 36.922 | 142.675 | 13.086 | 78.2% |
| systemGC | 4 | 1212.661 | 126.471 | 183.992 | 16.438 | 84.8% |
| systemGC | 8 | 2305.342 | 377.801 | 262.416 | 61.388 | 88.6% |
| youngGC | 0 | 7.89 | 4.68 | 4.881 | 0.793 | 38.1% |
| youngGC | 2 | 158.655 | 3.468 | 17.675 | 2.377 | 88.9% |
| youngGC | 4 | 314.051 | 89.393 | 25.555 | 7.368 | 91.9% |
| youngGC | 8 | 614.067 | 205.207 | 38.715 | 3.085 | 93.7% |
- G1GC
| Benchmark | accessedFieldCount | Baseline | Error (Baseline) | Fix | Error (Fix) | Improvement |
|---|---|---|---|---|---|---|
| fullGC | 0 | 34.526 | 3.841 | 33.561 | 4.284 | 2.8% |
| fullGC | 2 | 132.489 | 7.516 | 60.922 | 3.285 | 54.0% |
| fullGC | 4 | 215.601 | 16.953 | 69.777 | 15.359 | 67.6% |
| fullGC | 8 | 392.611 | 106.065 | 95.151 | 1.118 | 75.8% |
| systemGC | 0 | 35.906 | 8.445 | 35.455 | 10.434 | 1.3% |
| systemGC | 2 | 129.554 | 18.165 | 56.986 | 18.609 | 56.0% |
| systemGC | 4 | 220.130 | 29.353 | 71.640 | 5.775 | 67.5% |
| systemGC | 8 | 396.455 | 111.161 | 97.731 | 2.349 | 75.3% |
| youngGC | 0 | 0.642 | 0.257 | 0.685 | 0.290 | -6.7% |
| youngGC | 2 | 0.835 | 0.614 | 0.753 | 0.457 | 9.8% |
| youngGC | 4 | 0.870 | 0.196 | 0.767 | 0.445 | 11.8% |
| youngGC | 8 | 0.907 | 0.353 | 0.786 | 0.558 | 13.3% |
- ParallelGC
| Benchmark | accessedFieldCount | Baseline | Error (Baseline) | Fix | Error (Fix) | Improvement |
|---|---|---|---|---|---|---|
| fullGC | 0 | 24.158 | 5.731 | 23.700 | 2.172 | 1.9% |
| fullGC | 2 | 116.949 | 18.407 | 45.582 | 2.400 | 61.0% |
| fullGC | 4 | 205.175 | 24.970 | 59.216 | 3.524 | 71.1% |
| fullGC | 8 | 380.122 | 51.809 | 83.765 | 7.697 | 78.0% |
| systemGC | 0 | 24.518 | 3.683 | 24.100 | 4.200 | 1.7% |
| systemGC | 2 | 118.792 | 31.924 | 46.404 | 1.894 | 60.9% |
| systemGC | 4 | 205.749 | 52.843 | 59.857 | 8.125 | 70.9% |
| systemGC | 8 | 382.396 | 97.584 | 84.494 | 3.228 | 77.9% |
| youngGC | 0 | 0.325 | 0.153 | 0.329 | 0.039 | -1.2% |
| youngGC | 2 | 0.407 | 0.071 | 0.405 | 0.027 | 0.5% |
| youngGC | 4 | 0.405 | 0.156 | 0.398 | 0.030 | 1.7% |
| youngGC | 8 | 0.451 | 0.095 | 0.405 | 0.046 | 10.2% |
- GenShenGC
| Benchmark | accessedFieldCount | Baseline | Error (Baseline) | Fix | Error (Fix) | Improvement |
|---|---|---|---|---|---|---|
| fullGC | 0 | 32.799 | 1.287 | 32.638 | 16.186 | 0.5% |
| fullGC | 2 | 130.388 | 12.179 | 59.565 | 27.384 | 54.3% |
| fullGC | 4 | 221.452 | 12.654 | 71.501 | 3.271 | 67.7% |
| fullGC | 8 | 397.024 | 93.899 | 96.542 | 3.335 | 75.7% |
| systemGC | 0 | 34.699 | 16.677 | 33.430 | 1.115 | 3.7% |
| systemGC | 2 | 131.343 | 30.426 | 59.200 | 4.173 | 54.9% |
| systemGC | 4 | 220.197 | 48.079 | 72.529 | 1.808 | 67.1% |
| systemGC | 8 | 397.017 | 71.641 | 98.212 | 3.428 | 75.3% |
| youngGC | 0 | 35.788 | 18.645 | 33.368 | 2.175 | 6.8% |
| youngGC | 2 | 133.873 | 30.916 | 59.600 | 1.066 | 55.5% |
| youngGC | 4 | 220.862 | 45.541 | 74.423 | 53.835 | 66.3% |
| youngGC | 8 | 403.135 | 153.634 | 98.080 | 5.118 | 75.7% |
Progress
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
- [ ] Change must be properly reviewed (2 reviews required, with at least 2 Reviewers)
Issue
- JDK-8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance (Enhancement - P2)
Reviewers
- Axel Boldt-Christmas (@xmas92 - Reviewer) 🔄 Re-review required (review applies to 18476044)
Contributors
- Axel Boldt-Christmas
<[email protected]>
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28328/head:pull/28328
$ git checkout pull/28328
Update a local copy of the PR:
$ git checkout pull/28328
$ git pull https://git.openjdk.org/jdk.git pull/28328/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28328
View PR using the GUI difftool:
$ git pr show -t 28328
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28328.diff
Using Webrev
Based on https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation
:wave: Welcome back eastigeevich! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.
/contributor add xmas92
@eastig xmas92 was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:
-
/contributor add @openjdk-bot -
/contributor add duke -
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
@eastig The following label will be automatically applied to this pull request:
-
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
/contributor add aboldtch
@eastig
Contributor Axel Boldt-Christmas <[email protected]> successfully added.
Hi @fisk @theRealAph @xmas92 @shipilev
I created this draft PR based on @xmas92 work https://github.com/openjdk/jdk/compare/master...xmas92:jdk:deferred_icache_invalidation
Alex wrote about his implementation in JDK-8370947:
The implementation I linked is very aarch64 centric. I would like to create a bit nicer abstraction for this to allow easier adaption for other platforms.
I see his changes touch other backends. I tried to minimize changes and to avoid them in other backends. I don't think the concept of deferred icache invalidation will be used anywhere but for Neoverse-N1 errata.
This PR does not cover all cases in ZGC at the moment. It can be done as soon as we agree with a proper way to fix.
I'd like to hear your opinion which way we should choose:
- Abstraction of deferred icache invalidation supported in all backends.
- Concrete implementation focused on Neoverse-N1.
@eastig this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8370947
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
/issue JDK-8370947
@eastig This issue is referenced in the PR title - it will now be updated.
Webrevs
- 17: Full (fc4bbe9d)
- 16: Full - Incremental (2448b2d7)
- 15: Full - Incremental (44f43e4e)
- 14: Full - Incremental (85691beb)
- 13: Full - Incremental (b9380fd8)
- 12: Full (4b04496f)
- 11: Full - Incremental (8c5ef0e8)
- 10: Full - Incremental (79f9a2a0)
- 09: Full - Incremental (dbeeecf1)
- 08: Full - Incremental (ae3b97e8)
- 07: Full - Incremental (d36be373)
- 06: Full - Incremental (17456558)
- 05: Full - Incremental (42745e56)
- 04: Full - Incremental (e774cda1)
- 03: Full - Incremental (79297bd4)
- 02: Full - Incremental (20480771)
- 01: Full - Incremental (b60317e9)
- 00: Full (18476044)
@fisk, @xmas92 I added a JMH microbenchmark and its results from Graviton 2. I added deferred icache invalidation to ZGC where it's needed.
I found one place where I am not sure: https://github.com/openjdk/jdk/blob/b9ee9541cffb6c5a737b08a69ae04472b3bcab3e/src/hotspot/share/gc/z/zHeapIterator.cpp#L364
class ZHeapIteratorNMethodClosure : public NMethodClosure {
private:
OopClosure* const _cl;
BarrierSetNMethod* const _bs_nm;
public:
ZHeapIteratorNMethodClosure(OopClosure* cl)
: _cl(cl),
_bs_nm(BarrierSet::barrier_set()->barrier_set_nmethod()) {}
virtual void do_nmethod(nmethod* nm) {
// If ClassUnloading is turned off, all nmethods are considered strong,
// not only those on the call stacks. The heap iteration might happen
// before the concurrent processing of the code cache, make sure that
// all nmethods have been processed before visiting the oops.
_bs_nm->nmethod_entry_barrier(nm);
ZNMethod::nmethod_oops_do(nm, _cl);
}
};
For _bs_nm->nmethod_entry_barrier(nm) we use deferred icache invalidation. I'm not sure it is safe for ZNMethod::nmethod_oops_do(nm, _cl). It looks like it is safe because the code is called at a safepoint: ZHeap::object_iterate, ZHeap::object_and_field_iterate_for_verify and ZHeap::parallel_object_iterator.
@xmas92 The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).
@fisk, @xmas92 The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:
Benchmark Score Error Units
GCPatchingNmethodCost.fullGC:base 73.937 ± 17.764 ms/op
GCPatchingNmethodCost.systemGC:base 77.495 ± 11.963 ms/op
GCPatchingNmethodCost.youngGC:base 9.955 ± 1.649 ms/op
GCPatchingNmethodCost.fullGC:fix 88.865 ± 19.299 ms/op +20.1%
GCPatchingNmethodCost.systemGC:fix 90.572 ± 14.750 ms/op +16.9%
GCPatchingNmethodCost.youngGC:fix 10.219 ± 0.877 ms/op +2.7%
I think we'll also want a workaround for CodeBuffer::relocate_code_to().
I think we'll also want a workaround for
CodeBuffer::relocate_code_to().
Also we need to fix G1NMethodClosure::do_evacuation_and_fixup and ShenandoahNMethod::oops_do. They use nmethod::fix_oop_relocations.
Should we do it in this PR or in separate PRs?
I think we'll also want a workaround for
CodeBuffer::relocate_code_to().Also we need to fix
G1NMethodClosure::do_evacuation_and_fixupandShenandoahNMethod::oops_do. They usenmethod::fix_oop_relocations.Should we do it in this PR or in separate PRs?
Please, let's handle it all here.
It seems a little disruptive to have to pass defer_icache_invalidation around so much. What about attaching this information to the Thread or using a THREAD_LOCAL?
@xmas92
The added microbenchmark shows interesting regressions when an nmethod has no accesses to object's fields:
Benchmark Score Error Units GCPatchingNmethodCost.fullGC:base 73.937 ± 17.764 ms/op GCPatchingNmethodCost.systemGC:base 77.495 ± 11.963 ms/op GCPatchingNmethodCost.youngGC:base 9.955 ± 1.649 ms/op GCPatchingNmethodCost.fullGC:fix 88.865 ± 19.299 ms/op +20.1% GCPatchingNmethodCost.systemGC:fix 90.572 ± 14.750 ms/op +16.9% GCPatchingNmethodCost.youngGC:fix 10.219 ± 0.877 ms/op +2.7%
I think I might have an idea what causes the regressions. I'll be debugging it.
It seems a little disruptive to have to pass
defer_icache_invalidationaround so much. What about attaching this information to the Thread or using a THREAD_LOCAL?
I switched to a THREAD_LOCAL. Initially it regressed fullGG comparing to the version with the parameter:
- Parameter:
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 88.865 ± 19.299 ms/op
GCPatchingNmethodCost.fullGC 2 5000 avgt 3 146.184 ± 11.531 ms/op
GCPatchingNmethodCost.fullGC 4 5000 avgt 3 186.429 ± 16.257 ms/op
GCPatchingNmethodCost.fullGC 8 5000 avgt 3 262.933 ± 13.071 ms/op
- THREAD_LOCAL
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 93.899 ± 14.870 ms/op
GCPatchingNmethodCost.fullGC 2 5000 avgt 3 152.872 ± 13.566 ms/op
GCPatchingNmethodCost.fullGC 4 5000 avgt 3 194.425 ± 37.851 ms/op
GCPatchingNmethodCost.fullGC 8 5000 avgt 3 271.826 ± 47.908 ms/op
I found that ZBarrierSetAssembler::patch_barrier_relocation is only used when icache invalidation is deferred. I replaced a check of the thread local value with a check of NeoverseN1Errata1542419. This restored the performance:
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 84.919 ± 31.411 ms/op
GCPatchingNmethodCost.fullGC 2 5000 avgt 3 141.862 ± 7.026 ms/op
GCPatchingNmethodCost.fullGC 4 5000 avgt 3 184.921 ± 46.592 ms/op
GCPatchingNmethodCost.fullGC 8 5000 avgt 3 263.897 ± 48.271 ms/op
It might be that accesses to THREAD_LOCAL on Neoverse N1 are expensive.
Should I try attaching info to Thread?
Should I try attaching info to Thread?
It probably won't help, because Thread::current() is going to access a thread-local.
If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of ICacheInvalidationContext icic(nm) for every different "nm", can't we just do ICacheInvalidationContext icic(true) one time, outside the nmethod loop?
If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of
ICacheInvalidationContext icic(nm)for every different "nm", can't we just doICacheInvalidationContext icic(true)one time, outside the nmethod loop?
We can't disarm an nmethod before flushing the instructions.
If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of
ICacheInvalidationContext icic(nm)for every different "nm", can't we just doICacheInvalidationContext icic(true)one time, outside the nmethod loop?We can't disarm an nmethod before flushing the instructions.
Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?
If I understand correctly, the whole icache is flushed, so the actual nmethod* is irrelevant. So instead of
ICacheInvalidationContext icic(nm)for every different "nm", can't we just doICacheInvalidationContext icic(true)one time, outside the nmethod loop?We can't disarm an nmethod before flushing the instructions.
I don't think we flush the whole icache. We invalidate all translation entries (all VAs, all possible levels). I have not found any information that it would flush icache. I think TLBI is used as a heavyweight serialization barrier. It might force cores to synchronize their instruction fetch streams. We broadcast a TLB invalidation and wait for its completion. I think hardware data and instruction cache coherence still work.
I also found https://lore.kernel.org/linux-arm-kernel/[email protected]/ with more details on the errata workaround. These details look aligned with the hypothesis of a synchronization event to enforce ordering.
The problem is:
Neoverse-N1 cores with the 'COHERENT_ICACHE' feature may fetch stale instructions when software depends on prefetch-speculation-protection instead of explicit synchronization.
Prefetch-speculation-protection:
JIT can generate new instructions at some new location, then update a branch in the executable instructions to point at the new location.
Prefetch-speculation-protection guarantees that if another CPU sees the new branch, it also sees the new instructions that were written there.
I think, in the case of armed/disarmed nmethods we have explicit synchronization not prefetch-speculation-protection. Neither of thread execute armed nmethods. If I am correct, disarming is a process of releasing nmethod to allow its execution.
Sure, but you can't patch an nmethod until every thread that might be executing it has stopped. So if the threads are all stopped, why not postpone the disarmament until the end, just before you flush?
If my understanding is correct, we cannot disarm before flushing because disarming is like a release of a critical section. We must guarantee all changes we've made are visible to all observers when we leave the critical section.
As I wrote in the JBS issue we can:
- Get all nmethods armed
- Patch all of them
- Invalidate TLB
- Get all nmethods disarmed
This will complicate the fix a lot. Performance gain from is not worth. I measured theoretic performance when we don't do any invalidation. It's 3% - 4% better than the approach in this PR: invalidate TLB per nmethod.
Yeah patching all nmethods as one unit is basically equivalent to making the code cache processing a STW operation. Last time we processed the code cache STW was JDK 11. A dark place I don't want to go back to. It can get pretty big and mess up latency. So I'm in favour of limiting the fix and not re-introduce STW code cache processing.
Otherwise yes you are correct; we perform synchronous cross modifying code with no assumptions about instruction cache coherency because we didn't trust it would actually work for all ARM implementations. Seems like that was a good bet. We rely on it on x64 still though.
It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so, a horrible alternative might be to use mprotect to temporarily remove execution permission on the affected per nmethod pages, and detect over shooting in the signal handler, resuming execution when execution privileges are then restored immediately after. That should limit the affected VA to close to what is actually invalidated. But it would look horrible.
It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so,
"Because the cache-maintenance wasn't needed, we can do the TLBI instead. In fact, the I-Cache line-size isn't relevant anymore, we can reduce the number of traps by producing a fake value.
"For user-space, the kernel's work is now to trap CTR_EL0 to hide DIC, and produce a fake IminLine. EL3 traps the now-necessary I-Cache maintenance and performs the inner-shareable-TLBI that makes everything better."
My interpretation of this is that we only need to do the synchronization dance once, at the end of the patching. But I guess we don't know exactly if we have an affected core or if the kernel workaround is in action.
@xmas92 I fixed regressions for Java methods without field accesses I saw:
-
-XX:+NeoverseN1Errata1542419before the fix
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 88.865 ± 19.299 ms/op
GCPatchingNmethodCost.systemGC 0 5000 avgt 3 90.572 ± 14.750 ms/op
GCPatchingNmethodCost.youngGC 0 5000 avgt 3 10.219 ± 0.877 ms/op
-
-XX:+NeoverseN1Errata1542419after the fix
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 60.847 ± 23.735 ms/op
GCPatchingNmethodCost.systemGC 0 5000 avgt 3 62.338 ± 5.663 ms/op
GCPatchingNmethodCost.youngGC 0 5000 avgt 3 4.956 ± 1.440 ms/op
-
-XX:-NeoverseN1Errata1542419
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 67.144 ± 15.187 ms/op
GCPatchingNmethodCost.systemGC 0 5000 avgt 3 70.181 ± 30.271 ms/op
GCPatchingNmethodCost.youngGC 0 5000 avgt 3 7.906 ± 2.118 ms/op
I'll check SpecJVM as well.