jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8363620: AArch64: reimplement emit_static_call_stub()

Open fg1417 opened this issue 8 months ago • 45 comments

In the existing implementation, the static call stub typically emits a sequence like: isb; movk; movz; movz; movk; movz; movz; br.

This patch reimplements it using a more compact and patch-friendly sequence:

ldr x12, Label_data
ldr x8, Label_entry
br x8
Label_data:
  0x00000000
  0x00000000
Label_entry:
  0x00000000
  0x00000000

The new approach places the target addresses adjacent to the code and loads them dynamically. This allows us to update the call target by modifying only the data in memory, without changing any instructions. This avoids the need for I-cache flushes or issuing an isb[1], which are both relatively expensive operations.

While emitting direct branches in static stubs for small code caches can save 2 instructions compared to the new implementation, modifying those branches still requires I-cache flushes or an isb. This patch unifies the code generation by emitting the same static stubs for both small and large code caches.

A microbenchmark (StaticCallStub.java) demonstrates a performance uplift of approximately 43%.

Benchmark                       (length)   Mode   Cnt Master     Patch      Units
StaticCallStubFar.callCompiled    1000     avgt   5   39.346     22.474     us/op
StaticCallStubFar.callCompiled    10000    avgt   5   390.05     218.478    us/op
StaticCallStubFar.callCompiled    100000   avgt   5   3869.264   2174.001   us/op
StaticCallStubNear.callCompiled   1000     avgt   5   39.093     22.582     us/op
StaticCallStubNear.callCompiled   10000    avgt   5   387.319    217.398    us/op
StaticCallStubNear.callCompiled   100000   avgt   5   3855.825   2206.923   us/op

All tests in Tier1 to Tier3, under both release and debug builds, have passed.

[1] https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads


Progress

  • [x] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8363620: AArch64: reimplement emit_static_call_stub() (Enhancement - P4)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26638/head:pull/26638
$ git checkout pull/26638

Update a local copy of the PR:
$ git checkout pull/26638
$ git pull https://git.openjdk.org/jdk.git pull/26638/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26638

View PR using the GUI difftool:
$ git pr show -t 26638

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26638.diff

Using Webrev

Link to Webrev Comment

fg1417 avatar Aug 05 '25 10:08 fg1417

:wave: Welcome back fgao! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Aug 05 '25 10:08 bridgekeeper[bot]

@fg1417 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8363620: AArch64: reimplement emit_static_call_stub()

Co-authored-by: Andrew Haley <[email protected]>
Reviewed-by: aph

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 24 new commits pushed to the master branch:

  • 1f49edd9783ed4579d989d6939ee75e926f0716a: 4459231: Focus of JTabbedPane(with Scrollable tablayout) changes on change in LookAndFeel
  • 786833cd1bf8eda1cef25da392a055f4eb371abf: 8373022: serviceability/sa/ClhsdbScanOops.java assumes no GC should occur
  • 9c91c68d1d5938d7e2b9a90c82b0a36ef1a063cd: 8373111: Test java/lang/management/MemoryMXBean/MemoryManagement.java timed out
  • ... and 21 more: https://git.openjdk.org/jdk/compare/7da91533aaf2033cedee6e2a56fb693f26909df5...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Aug 05 '25 10:08 openjdk[bot]

@fg1417 The following label will be automatically applied to this pull request:

  • hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Aug 05 '25 10:08 openjdk[bot]

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:

ldr x8, Label_entry
ldr x12, Label_data
br x8

Is this still helpful on modern aarch64 hardware?

dean-long avatar Aug 05 '25 21:08 dean-long

In the past we would have scheduled the load for x8 earlier to avoid a latency hit for using the value in the next instruction:

ldr x8, Label_entry
ldr x12, Label_data
br x8

Is this still helpful on modern aarch64 hardware?

Maybe on something very small, such as the Cortex-A34. It doesn't hurt larger out-of-order cores, so I agree with you that we should reorder this.

theRealAph avatar Aug 06 '25 08:08 theRealAph

Try this. It might be enough to rescue this PR.

diff --git a/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.hpp b/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.hpp
index 12b941fc4f7..29853ed4a10 100644
--- a/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.hpp
@@ -68,8 +68,8 @@ friend class ArrayCopyStub;
 
   enum {
     // call stub: CompiledDirectCall::to_interp_stub_size() +
-    //            CompiledDirectCall::to_trampoline_stub_size()
-    _call_stub_size = 13 * NativeInstruction::instruction_size,
+    //            CompiledDirectCall::to_trampoline_stub_size() + alignment nop
+    _call_stub_size = 15 * NativeInstruction::instruction_size,
     _exception_handler_size = DEBUG_ONLY(1*K) NOT_DEBUG(175),
     _deopt_handler_size = 7 * NativeInstruction::instruction_size
   };
diff --git a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
index 4285524514b..63577a55809 100644
--- a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp
@@ -983,22 +983,29 @@ void MacroAssembler::emit_static_call_stub() {
   // Jump to the entry point of the c2i stub.
   const int stub_start_offset = offset();
   Label far_jump_metadata, far_jump_entry;
-  ldr(rmethod, far_jump_metadata);
+  {
+    Label retry; bind(retry);
+    adr(rmethod, far_jump_metadata);
+    ldar(rmethod, (rmethod));
+    cbz(rmethod, retry);
+  }
   ldr(rscratch1, far_jump_entry);
   br(rscratch1);
+  nop();
   bind(far_jump_metadata);
-  assert(offset() - stub_start_offset == NativeStaticCallStub::far_jump_metadata_offset,
+  assert(offset() - stub_start_offset == NativeStaticCallStub::metadata_offset,
          "should be");
+  assert(is_aligned(offset(), BytesPerWord), "offset is misaligned");
   emit_int64(0);
   bind(far_jump_entry);
-  assert(offset() - stub_start_offset == NativeStaticCallStub::far_jump_entrypoint_offset,
+  assert(offset() - stub_start_offset == NativeStaticCallStub::entrypoint_offset,
          "should be");
   emit_int64(0);
 }
 
 int MacroAssembler::static_call_stub_size() {
-  // ldr; ldr; br; zero; zero; zero; zero;
-  return 7 * NativeInstruction::instruction_size;
+  // adr; ldar; cbz; ldr; br; nop; zero; zero; zero; zero;
+  return 10 * NativeInstruction::instruction_size;
 }
 
 void MacroAssembler::c2bool(Register x) {
diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
index 12a6eb6f7f0..5d33d94f1ae 100644
--- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
@@ -393,22 +393,20 @@ void NativeCall::trampoline_jump(CodeBuffer &cbuf, address dest, JVMCI_TRAPS) {
 #ifndef PRODUCT
 // Mirror the logic in CompiledDirectCall::verify_mt_safe().
 void NativeStaticCallStub::verify_static_stub(const methodHandle& callee, address entry) {
-  intptr_t metadata = intptr_at(far_jump_metadata_offset);
-  address entrypoint = ptr_at(far_jump_entrypoint_offset);
+  intptr_t metadata = intptr_at(metadata_offset);
+  address entrypoint = ptr_at(entrypoint_offset);
   CompiledDirectCall::verify_mt_safe_helper(callee, entry, metadata, entrypoint);
 }
 #endif
 
 void NativeStaticCallStub::set_metadata_and_destination(intptr_t callee, address entry) {
-  set_intptr_at(far_jump_metadata_offset, callee);
-  set_ptr_at(far_jump_entrypoint_offset, entry);
-  OrderAccess::release();
+  set_ptr_at(entrypoint_offset, entry);
+  Atomic::release_store((intptr_t*)addr_at(metadata_offset), callee);
 }
 
 void NativeStaticCallStub::verify_instruction_sequence() {
-  if (! (nativeInstruction_at(addr_at(0))->is_ldr_literal() &&
-         nativeInstruction_at(addr_at(NativeInstruction::instruction_size))->is_ldr_literal() &&
-         nativeInstruction_at(addr_at(NativeInstruction::instruction_size * 2))->is_blr())) {
+  if (! (nativeInstruction_at(addr_at(NativeInstruction::instruction_size * 3))->is_ldr_literal() &&
+         nativeInstruction_at(addr_at(NativeInstruction::instruction_size * 4))->is_blr())) {
     fatal("Not expected instructions in static call stub");
   }
 }
diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.hpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.hpp
index 17c87bf7c2b..897152f26f1 100644
--- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.hpp
@@ -461,8 +461,8 @@ class NativeStaticCallStub : public NativeInstruction {
 public:
 
   enum AArch64_specific_constants {
-    far_jump_metadata_offset      =   3 * 4,
-    far_jump_entrypoint_offset    =   5 * 4
+    metadata_offset      =   6 * NativeInstruction::instruction_size,
+    entrypoint_offset    =   6 * NativeInstruction::instruction_size + wordSize ,
   };
 
   void set_metadata_and_destination(intptr_t callee, address entry);

theRealAph avatar Aug 09 '25 19:08 theRealAph

There may still be a race in set_to_clean(), which doesn't zero out the [method, entry} fields in the stub. I'd fix that, to be safe. It'd be tricky to test.

theRealAph avatar Aug 09 '25 19:08 theRealAph

There may still be a race in set_to_clean(), which doesn't zero out the [method, entry} fields in the stub. I'd fix that, to be safe. It'd be tricky to test.

Ah no, that would create another race when unloading. This stuff is hard...

theRealAph avatar Aug 10 '25 07:08 theRealAph

There may still be a race in set_to_clean(), which doesn't zero out the [method, entry} fields in the stub. I'd fix that, to be safe. It'd be tricky to test.

It appears that CompiledDirectCall::set_stub_to_clean is currently unused. I would suggest either leaving it as-is or even better removing it entirely to avoid the risk of someone using it in the future since it's not safe under all conditions.

For example, consider a situation when a system thread is preempted by the OS while executing a static stub. If another thread clears that static stub before the preempted thread resumes, the first thread might execute a partially cleared sequence of movk/movz instructions.

Here's a simplified scenario:

Thread A Thread B
isb -
movk -
movz -
PREEMPTED Clears the static stub
movz (cleared) -
... ...

mikabl-arm avatar Aug 12 '25 13:08 mikabl-arm

There may still be a race in set_to_clean(), which doesn't zero out the [method, entry} fields in the stub. I'd fix that, to be safe. It'd be tricky to test.

It appears that CompiledDirectCall::set_stub_to_clean is currently unused.

My mistake. I should have written CompiledIC::set_to_clean(), which is used.

I would suggest either leaving it as-is or even better removing it entirely to avoid the risk of someone using it in the future since it's not safe under all conditions.

For example, consider a situation when a system thread is preempted by the OS while executing a static stub. If another thread clears that static stub before the preempted thread resumes, the first thread might execute a partially cleared sequence of movk/movz instructions.

Indeed.

In fact, it's not safe to clear the stub in any case because stubs are shared between calls to the same method.

theRealAph avatar Aug 12 '25 13:08 theRealAph

My mistake. I should have written CompiledIC::set_to_clean(), which is used.

Doesn't CompiledIC::set_to_clean() handle virtual call sites? If you referred to CompiledDirectCall::set_to_clean() instead, I believe no miscommunication happened here. What I meant is that CompiledDirectCall::set_to_clean() must not zero out the [method, entry} fields in the stub by calling CompiledDirectCall::set_stub_to_clean(). The latter is currently unused. Should it be removed entirely perhaps?

mikabl-arm avatar Aug 12 '25 14:08 mikabl-arm

@fg1417 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Sep 09 '25 19:09 bridgekeeper[bot]

@fg1417 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar Oct 08 '25 01:10 bridgekeeper[bot]

/open

fg1417 avatar Oct 27 '25 10:10 fg1417

@fg1417 This pull request is now open

openjdk[bot] avatar Oct 27 '25 10:10 openjdk[bot]

/open

Why? Do you have any reasonable expectation of a better way to do it?

theRealAph avatar Oct 27 '25 10:10 theRealAph

Hi @theRealAph, I’ve been thinking more about this topic and want to share a few updated observations.

In the current implementation of the static call stub, the executing thread runs the following instructions:

[main code]
L0:
  bl trampoline_stub
  ... {post call}

trampoline_stub:
  ldr  x8, callee_address
  br   x8
callee_address:
  0x12345678
  0x12345678

static_stub:
  isb
  mov    x12, #0x0
  movk   x12, #0x0, lsl #16
  movk   x12, #0x0, lsl #32
  mov    x8, #0x0
  movk   x8, #0x0, lsl #16
  movk   x8, #0x0, lsl #32
  br     x8

The writing thread performs the following steps:

  1. Updates the MOV instructions in static_stub with new values.
  2. Calls ICache::invalidate_range().
  3. Writes a bl static_stub instruction at L0. (Note: both static_stub and trampoline_stub reside in the stub section, which is directly reachable by bl.)

In the existing implementation, when the writing thread writes the bl static_stub instruction at L0, it also updates the callee_address in trampoline_stub to the address of static_stub. See https://github.com/openjdk/jdk/blob/cc9483b4da1a0f65f8773d0c7f35f2e6a7e1bd4f/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp#L92

If we modify the code slightly as follows:

  // Patch the call.
  if (reachable) {
    set_destination(dest);
  } else {
    // Patch the constant in the call's trampoline stub.
    address trampoline_stub_addr = get_trampoline();
    if (trampoline_stub_addr != nullptr) {
      assert (! is_NativeCallTrampolineStub_at(dest), "chained trampolines");
      nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(dest);
    }
    assert (trampoline_stub_addr != nullptr, "we need a trampoline");
    set_destination(trampoline_stub_addr);
  }

Then the callee_address in trampoline_stub would no longer be expected to point to static_stub.

According to an older version of the Arm ARM (https://developer.arm.com/documentation/ddi0487/ia), section A2.2.2, there is an architectural guarantee called Prefetch Speculation Protection (PSP).

From the Arm community blog: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads:

Prefetch speculation protection is an architectural guarantee that makes some minimally-synchronized code updates possible.

“Prefetch speculation protection” is an name from old editions of the Arm ARM (such as DDI0487 I.a). In recent versions, the required behaviours are covered by the formal concurrency model (as in section B2.3 of DDI0487 Latest), but this specific set of properties no longer has its own name.

Essentially, when a writing thread rewrites a direct branch with an updated direct branch, and another thread is concurrently executing that modified code, PSP ensures that the executing thread does not fetch stale instructions.

According to the B2.3.9 Ordering of instruction fetches of the specification: If we update the MOVs in static_stub, and ensure coherence between the data writes and instruction fetches within the same shareability domain before writing bl static_stub at L0, then for any observer: An instruction read from L0 appears in program order before an instruction fetched from static_stub, if the instruction at L0 contains the updated direct branch, then the subsequent fetch from static_stub will contain the updated MOV values.

These two properties imply that even without the isb in static_stub, if the executing thread fetches the updated direct branch at L0 that jumps to static_stub, PSP guarantees it won’t execute stale MOVs from static_stub.

A straightforward way to confirm this behavior is to run a litmus test in herd7.

The author of the blog post, @jacobbramley, also shared an example litmus test:

AArch64 PrefetchSpeculationProtection

(* Copyright 2025 Arm Limited *)

(*
   A canonical example of Prefetch Speculation Protection. P0 writes (and
   flushes) some code at P1's `new`, then rewrites P1's BL to point to it.
  
   If P1 observes the new BL before it executes, then PSP guarantees that it
   also executes the replaced `new` code.
  
   If P1 doesn't observe the new BL, it calls `old`, which hasn't changed.
*)

{
0:X0=instr:"MOV w0, #2";
0:X1=instr:"BL .+16";
0:X10=P1:new;
0:X11=P1:L0;
}

P0              |  P1            ;
STR W0, [X10]   |L0:             ;
DC CVAU, X10    |  BL old        ;
DSB ISH         |  B end         ;
IC IVAU, X10    |old:            ;
DSB ISH         |  MOV w0, #0    ;
                |  RET           ;
STR W1, [X11]   |new:            ;
                |  MOV w0, #1    ;
                |  RET           ;
                |end:            ;
exists(1:X0=0 \/ 1:X0=2)

Running it with herd7 produces the following output:

Test PrefetchSpeculationProtection Allowed
States 2
1:X0=0;
1:X0=2;
Ok
Witnesses
Positive: 2 Negative: 0
Flag Assuming-common-inner-shareable-domain
Flag Assuming-no-two-modified-instructions-are-on-the-same-cache-line
Condition exists (1:X0=0 \/ 1:X0=2)
Observation PrefetchSpeculationProtection Always 2 0

This result confirms that Prefetch Speculation Protection still applies under the formal model and also ensures that even without the isb in Label new, if the executing thread fetches the updated branch at L0, PSP guarantees it won’t execute stale MOV from Label new.

Therefore, it suggests that even without an isb in static_stub, if the executing thread observes the updated BL instruction first, PSP ensures that it will also fetch and execute the updated MOV instructions in static_stub.

What do you think? Really appreciate any feedback! Thanks!

fg1417 avatar Oct 27 '25 10:10 fg1417

Then the callee_address in trampoline_stub would no longer be expected to point to static_stub.

This is to ensure that the patched code is only reachable via a direct jump, I guess?

theRealAph avatar Oct 27 '25 11:10 theRealAph

According to an older version of the Arm ARM (https://developer.arm.com/documentation/ddi0487/ia), section A2.2.2, there is an architectural guarantee called Prefetch Speculation Protection (PSP).

From the Arm community blog: https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/caches-self-modifying-code-working-with-threads:

Yes, of course we're aware of that guarantee: without it, it would not be possible to JIT-compile a new method, then branch to it from another method. JIT compilation wouldn't be possible.

[If I recall correctly, this change to the Arm ARM in part resulted from a conversation between us and Will Deacon.]

theRealAph avatar Oct 27 '25 11:10 theRealAph

OK, this one looks way more practical than the previous version of this PR. As long as we still use only instruction fetches in the stub, it ought to work.

The correctness of this is rather subtle, however, and the change to set_destination_mt_safe is downright obscure.

It perhaps would be clearer to leave the first instruction in the call stub as an ISB. When we patched the call stub, we'd change it to branch+1. That might work as well, but be simpler and much less obscure.

theRealAph avatar Oct 27 '25 12:10 theRealAph

@theRealAph Thanks so much for your quick feedback!

Then the callee_address in trampoline_stub would no longer be expected to point to static_stub.

This is to ensure that the patched code is only reachable via a direct jump, I guess?

Yes, that’s exactly what I mean.

OK, this one looks way more practical than the previous version of this PR. As long as we still use only instruction fetches in the stub, it ought to work.

The correctness of this is rather subtle, however, and the change to set_destination_mt_safe is downright obscure.

It perhaps would be clearer to leave the first instruction in the call stub as an ISB. When we patched the call stub, we'd change it to branch+1. That might work as well, but be simpler and much less obscure.

Let me confirm if I’ve understood you correctly. Initially, the static call stub looks like this:

[main code]
L0:
  bl trampoline_stub
  ... {post call}

static_stub:
  isb
  mov    x12, #0x0
  movk   x12, #0x0, lsl #16
  movk   x12, #0x0, lsl #32
  mov    x8, #0x0
  movk   x8, #0x0, lsl #16
  movk   x8, #0x0, lsl #32
  br     x8

After the stub is patched, it becomes:

[main code]
L0:
  bl static_stub
  ... {post call}

static_stub:
  b entry_point
entry_point:
  mov    x12, #0x0
  movk   x12, #0x0, lsl #16
  movk   x12, #0x0, lsl #32
  mov    x8, #0x0
  movk   x8, #0x0, lsl #16
  movk   x8, #0x0, lsl #32
  br     x8

fg1417 avatar Oct 27 '25 14:10 fg1417

@fg1417 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Nov 24 '25 22:11 bridgekeeper[bot]

This one has gone very quiet.

theRealAph avatar Nov 25 '25 12:11 theRealAph

@theRealAph sorry for the delay!

I’ve updated the patch with a new commit. In the meantime, we also found that patching isb to nop can work the same as patching isb to b .+4. I’ve summarized my understanding in the code comments. Could you please take another look? Thank you for your time!

fg1417 avatar Nov 28 '25 10:11 fg1417

I’ve updated the patch with a new commit. In the meantime, we also found that patching isb to nop can work the same as patching isb to b .+4.

Please don't. The slow path via an indirect jump to the static call stub is not important. It makes no sense to use the Arm memory model in such a subtle way merely to avoid a perfectly-predicted rare branch. The time taken to understand such trickery by reviewers and future maintainers is not justified by the minuscule performance gain.

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian Kernighan

theRealAph avatar Nov 28 '25 15:11 theRealAph

I’ve updated the patch with a new commit. In the meantime, we also found that patching isb to nop can work the same as patching isb to b .+4.

Please don't. The slow path via an indirect jump to the static call stub is not important. It makes no sense to use the Arm memory model in such a subtle way merely to avoid a perfectly-predicted rare branch. The time taken to understand such trickery by reviewers and future maintainers is not justified by the minuscule performance gain.

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian Kernighan

@theRealAph Thanks for your valuable input!

I’m trying to clarify how I understand your idea of patching isb to b .+4. I agree that the correctness of removing the isb in the stub is quite subtle, and the change to set_destination_mt_safe is rather obscure. Because of this, I initially assumed that the motivation for patching isb to b .+4 was to avoid introducing a tricky change to set_destination_mt_safe, thereby keeping the PR self-contained. Based on that understanding, I thought: if the main reason for patching isb to b .+4 is to avoid additional complexity, and patching isb to nop can provide the same effect, then why not patch it to nop instead? That’s what led me to propose the new commit.

However, now I’m a bit confused, and I think I may not have fully understood your original intent behind patching isb to b .+4. In your original idea, were you suggesting a change like this:

diff --git a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
index 6fe3315014b..9bf29ff2b55 100644
--- a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
@@ -100,6 +100,11 @@ void CompiledDirectCall::set_to_interpreted(const methodHandle& callee, address
   method_holder->set_data((intptr_t)callee());
   MacroAssembler::pd_patch_instruction(method_holder->next_instruction_address(), entry);
   ICache::invalidate_range(stub, to_interp_stub_size());
+  // Patch 'isb' to 'b .+4'.
+  CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
+  Assembler assembler(&stub_first_instruction);
+  assembler.b(assembler.pc() + 4);
+
   // Update jump to call.
   set_destination_mt_safe(stub);
 }
@@ -109,6 +114,10 @@ void CompiledDirectCall::set_stub_to_clean(static_stub_Relocation* static_stub)
   address stub = static_stub->addr();
   assert(stub != nullptr, "stub not found");
   assert(CompiledICLocker::is_safe(stub), "mt unsafe call");
+  // Patch 'b .+4' to 'isb'.
+  CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
+  Assembler assembler(&stub_first_instruction);
+  assembler.isb();
   // Creation also verifies the object.
   NativeMovConstReg* method_holder
     = nativeMovConstReg_at(stub + NativeInstruction::instruction_size);

Thank you for your time!

fg1417 avatar Nov 28 '25 17:11 fg1417

I think I'd do something like this. I does mean that we're executing an unnecessary jump+1 when we jump directly to the stub, but it maintains the invariant that the trampoline destination and the call destination are the same, so it does not matter how a call reaches the static call stub. I think this invariant is worth keeping.

Remember that we're jumping from compiled code to the interpreter, which does thousands of jumps! A single extra well-predicted branch won't hurt.

diff --git a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
index 3f3b8d28408..87887bb0a25 100644
--- a/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/compiledIC_aarch64.cpp
@@ -168,12 +168,16 @@ void CompiledDirectCall::set_to_interpreted(const methodHandle& callee, address
   //                             |  B end                    ;
   //                             |end:                       ;
   // forall (1:X0=1 \/ 1:X0=3)

We can't use Assembler to do this patching because it's not atomic.

-  CodeBuffer stub_first_instruction(stub, Assembler::instruction_size);
-  Assembler assembler(&stub_first_instruction);
-  assembler.nop();
+
+  NativeJump::insert(stub, stub + NativeJump::instruction_size);
+
+  address trampoline_stub_addr = _call->get_trampoline();
+  if (trampoline_stub_addr != nullptr) {
+    nativeCallTrampolineStub_at(trampoline_stub_addr)->set_destination(stub);
+  }
 
   // Update jump to call.
-  set_destination_mt_safe(stub);
+  _call->set_destination(stub);
 }
 
 void CompiledDirectCall::set_stub_to_clean(static_stub_Relocation* static_stub) {
diff --git a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
index f2003dd9b55..22e7dcc2552 100644
--- a/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/nativeInst_aarch64.cpp
@@ -238,6 +238,14 @@ void NativeJump::set_jump_destination(address dest) {
   ICache::invalidate_range(instruction_address(), instruction_size);
 };
 
+// Atomic insertion of jump to target.
+void NativeJump::insert(address code_pos, address target) {
+  intptr_t offset = target - code_pos;
+  uint32_t insn = 0b000101 << 26;
+  Instruction_aarch64::spatch((address)&insn, 25, 0, offset >> 2);
+  AtomicAccess::store((volatile uint32_t*)code_pos, insn);
+}
+  
 //-------------------------------------------------------------------
 
 address NativeGeneralJump::jump_destination() const {

theRealAph avatar Nov 30 '25 11:11 theRealAph

@theRealAph thanks a lot for your explanation! I'll update it soon.

fg1417 avatar Nov 30 '25 13:11 fg1417

Your benchmark doesn't work. Please fix it.

theRealAph avatar Nov 30 '25 17:11 theRealAph