ndk icon indicating copy to clipboard operation
ndk copied to clipboard

[Bug]: Android NDK r27 rc2 miscompiles code with indirect gotos

Open SanjaLV opened this issue 1 year ago • 18 comments

Description

Android NDK r27 rc 2 produces invalid code for any target architecture when compiling with any nonzero optimization level.

Bellow is attached minimized/striped sample that shows the problem (when targeting x86_64 with -O1)

extern int printf(const char *fmt, ...);

int main() {
  void* bytecode[2];
  bytecode[0] = &&VM__OP_1;
  bytecode[1] = &&VM__TERMINATE;

  int state = 0;
  int index = 0;

  while (1) {
    switch (state) {
    case 0:
      goto *bytecode[index];
    case 1:
      // NOTE: THIS IS ONLY REACHABLE VIA INDIRECT GOTOS
      VM__OP_1:
      state = 2;
      break;
    case 2:
      printf("OP_1:(instruction=%d)\n", index);
      index++;
      goto *bytecode[index];
    }
  }

VM__TERMINATE:
  printf("TERMINATE:(instruction=%d)\n", index);
  return 0;
}

Link to github project: https://github.com/SanjaLV/ndk-bug-reports/tree/main/r27_rc2

Prerequisites:

  1. Linux/macOS machine
  2. ANDROID_HOME env variable that will point to Android SDK root.
  3. ndk;26.3.11579264 / ndk;27.0.11902837 installed with SDK manager.

How to reproduce (invalid code):

  1. Run make local and observe correct behavior with system compiler
  2. Run make r26 and observe correct behavior when compiling with NDK r26d
  3. Run make r27 and observe incorrect program behavior.
  4. Run optnone and observe correct behavior with O0 optimization level.

Correct execution should yield the following output:

OP_1:(instruction=0)
TERMINATE:(instruction=1)

Incorrect NDK r27 execution results in the following output:

TERMINATE:(instruction=0)

Context:

Originally discovered that upgrading NDK from r26d to r27 r1/rc2 broke state-machine like bytecode interpreter. After some investigation, we found out that bug appears if and only if we enable INDIRECT GOTO optimizations.

Feel free to ask for more information.

Many thanks, Aleksandrs

Upstream bug

No response

Commit to cherry-pick

No response

Affected versions

r27

Canary version

No response

Host OS

Linux

Host OS version

Ubuntu 22.04

Affected ABIs

armeabi-v7a, arm64-v8a, x86, x86_64

SanjaLV avatar Jul 11 '24 11:07 SanjaLV

miscompilation is obvious if you inspect generated llvm bitcode:

27.0.11902837/toolchains/llvm/prebuilt/linux-x86_64/bin/clang -O1 test.c -S -emit-llvm & cat test.ll

...

; Function Attrs: nofree nounwind uwtable
define dso_local i32 @main() local_unnamed_addr #0 {
  %1 = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef 0)
  ret i32 0
}

...

SanjaLV avatar Jul 11 '24 11:07 SanjaLV

Can't repro on clang upstream https://godbolt.org/z/d9bGnhhv6 so the fix is likely to be in upstream.

appujee avatar Jul 11 '24 19:07 appujee

The bug is in SimplifyCFGPass

Before

define dso_local noundef i32 @main() #0 {
  %1 = alloca [2 x ptr], align 16
  store ptr blockaddress(@main, %7), ptr %1, align 16, !tbaa !5
  %2 = getelementptr inbounds [2 x ptr], ptr %1, i64 0, i64 1
  store ptr blockaddress(@main, %15), ptr %2, align 8, !tbaa !5
  br label %3

3:                                                ; preds = %0, %12
  %4 = phi i32 [ 0, %0 ], [ %13, %12 ]
  %5 = phi i32 [ 0, %0 ], [ %14, %12 ]
  switch i32 %4, label %12 [
    i32 0, label %6
    i32 1, label %7
    i32 2, label %9
  ]

6:                                                ; preds = %3
  br label %17

7:                                                ; preds = %3, %17
  %8 = phi i32 [ %18, %17 ], [ %5, %3 ]
  %8 = phi i32 [ %18, %17 ], [ %5, %3 ]
  br label %12

9:                                                ; preds = %3
  %10 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %5)
  %11 = add nsw i32 %5, 1
  br label %17

12:                                               ; preds = %3, %7
  %13 = phi i32 [ %4, %3 ], [ 2, %7 ]
  %14 = phi i32 [ %5, %3 ], [ %8, %7 ]
  br label %3, !llvm.loop !9

15:                                               ; preds = %17
  %16 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %
  ret i32 0

17:                                               ; preds = %9, %6
  %18 = phi i32 [ %11, %9 ], [ %5, %6 ]
  %19 = sext i32 %18 to i64
  %20 = getelementptr inbounds [2 x ptr], ptr %1, i64 0, i64 %19
  %21 = load ptr, ptr %20, align 8, !tbaa !5
  indirectbr ptr %21, [label %7, label %15]
}

After

; *** IR Dump After SimplifyCFGPass on main ***
; Function Attrs: mustprogress norecurse uwtable
define dso_local noundef i32 @main() #0 {
  %1 = alloca [2 x ptr], align 16
  store ptr blockaddress(@main, %6), ptr %1, align 16, !tbaa !5
  %2 = getelementptr inbounds [2 x ptr], ptr %1, i64 0, i64 1
  store ptr blockaddress(@main, %12), ptr %2, align 8, !tbaa !5
  br label %3

3:                                                ; preds = %0, %10
  %4 = phi i32 [ 0, %0 ], [ %11, %10 ]
  %5 = phi i32 [ 0, %0 ], [ %5, %10 ]
  switch i32 %4, label %10 [
    i32 0, label %12
    i32 1, label %6
    i32 2, label %7
  ]

6:                                                ; preds = %3
  br label %10

7:                                                ; preds = %3
  %8 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %5)
  %9 = add nsw i32 %5, 1
  br label %12

10:                                               ; preds = %3, %6
  %11 = phi i32 [ %4, %3 ], [ 2, %6 ]
  br label %3, !llvm.loop !9
12:                                               ; preds = %7, %3
  %13 = phi i32 [ %9, %7 ], [ %5, %3 ]
  %14 = call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1, i32 noundef %
  ret i32 0
}

appujee avatar Aug 01 '24 21:08 appujee

https://r.android.com/3211575

appujee avatar Aug 07 '24 16:08 appujee

Do we need a separate fix for r28? That's currently on clang-r530567, and I think the plan was to keep on that one.

DanAlbert avatar Aug 08 '24 20:08 DanAlbert

The fix is already included in r28 (revision number for the fix is r523414).

pirama-arumuga-nainar avatar Aug 08 '24 22:08 pirama-arumuga-nainar

Great, thanks for confirming.

DanAlbert avatar Aug 08 '24 22:08 DanAlbert

by the way, the patch i have uploaded 'hides' the issue. the real issue is still in llvm trunk. It could take a while to land a fix but I'm preparing a patch for review.

appujee avatar Aug 12 '24 23:08 appujee

I see. Hidden well enough that it's not something a user would encounter any more, or just way less common?

DanAlbert avatar Aug 12 '24 23:08 DanAlbert

So the change in https://r.android.com/3211575 'updates the CFG' in a way that the later (buggy) pass can't see the problematic pattern anymore. Most likely the issue can't be reproduced in the new compiler but the current situation isn't ideal.

appujee avatar Aug 12 '24 23:08 appujee

Right. So, fixed as far as any NDK developer would be able to tell, but there's room for improvement upstream. I'll mark it fixed here once we merge the fix into the NDK then.

DanAlbert avatar Aug 13 '24 22:08 DanAlbert

I believe https://github.com/llvm/llvm-project/commit/fc6bdb8549842613da51b9d570b29e27cc709f69 is the patch causing the bug. I'm preparing a fix but reverting this should also work.

appujee avatar Aug 14 '24 06:08 appujee

https://r.android.com/3218272 reverts the above mentioned patch.

appujee avatar Aug 14 '24 18:08 appujee

Hey, just wanted to update you that r27b indeed fixes minimal repro that was reported here, but sadly it still miscompiles the non-reduced code.

This is not a blocker for us, since we have disabled indirect goto optimizations for any r27 minor ndks builds, and any future NDK major release will have the proper fix.

I don't think it is worth re-opening the issue, because nobody else reported the same problem.

SanjaLV avatar Sep 10 '24 12:09 SanjaLV

Aditya has https://github.com/llvm/llvm-project/pull/103688 to fix this more generally in upstream.

pirama-arumuga-nainar avatar Sep 10 '24 18:09 pirama-arumuga-nainar

I don't think it is worth re-opening the issue, because nobody else reported the same problem.

We'll reopen in case we're able to opportunistically pick up the fix, but ack, we won't make a big deal out of it :)

DanAlbert avatar Sep 10 '24 19:09 DanAlbert

Landed in upstream.

appujee avatar Sep 11 '24 14:09 appujee

@appujee Can you backport to llvm-toolchain, llvm-r530567 and llvm-r522817 branches?

pirama-arumuga-nainar avatar Sep 11 '24 17:09 pirama-arumuga-nainar

Hey, The original (non-reduced) problem still occurs with the r28 beta 1 (28.0.12433566).

Just wanted to make sure this is a known thing. And is it not the case that r28 beta 1 was suppose to ship with cherry-picked fix and there are other problems in clang compiler.

SanjaLV avatar Oct 09 '24 11:10 SanjaLV

No, that's not expected. Thanks for the heads up. @pirama-arumuga-nainar FYI

DanAlbert avatar Oct 09 '24 18:10 DanAlbert

I think the fix hadn't landed when we picked the toolchain for r28 beta 1, so it doesn't inlcude the cherry-pick r.android.com/3267834. The final release will include the fix.

pirama-arumuga-nainar avatar Oct 09 '24 19:10 pirama-arumuga-nainar