rr Intermittent counter interrupts missed on AWS Aarch64 Graviton instances

ESC[?2004h(rr) break abort_ip
break abort_ip
ESC[?2004l^MBreakpoint 1 at 0xaaaacde00ef4
c
ESC[?2004h(rr) c
ESC[?2004l^MContinuing.
[FATAL src/PerfCounters.cc:824:read_ticks()] 
 (task 2569834 (rec:2567804) at time 189)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 48306063 ticks, expected no more than 24154018
Tail of trace dump:
...
{
  real_time:1808.840407 global_time:188, event:`SYSCALL: sched_yield' (state:ENTERING_SYSCALL) tid:2567752, ticks:435
x[0]:0x4 x[1]:0xe x[2]:0x0 x[3]:0x0 x[4]:0xffff846f72f0 x[5]:0x64 x[6]:0xfffffff9 x[7]:0x0 x[8]:0x7c x[9]:0xffff8467f861 x[10]:0x0 x[11]:0xa x[12]:0x272e7c x[13]:0x0 x[14]:0x1 x[15]:0xffffffff x[16]:0xaaaacde11f90 x[17]:0xffff84857344 x[18]:0x2 x[19]:0xffffe2e4a288 x[20]:0x1 x[21]:0xaaaacde11d00 x[22]:0xffff848ba040 x[23]:0xaaaacde014f0 x[24]:0xffff8481b000 x[25]:0x0 x[26]:0xffffe2e4a298 x[27]:0xaaaacde11d00 x[28]:0x0 x[29]:0xffff8467ff70 x[30]:0xaaaacde013e4 sp:0xffff8467ff70 pc:0xffff8485735c pstate:0x80001000
}
{
  real_time:1808.900190 global_time:189, event:`SCHED' tid:2567804, ticks:93949855
x[0]:0x0 x[1]:0x4d2 x[2]:0x4d2 x[3]:0xffff848bb038 x[4]:0xffff848bb190 x[5]:0x1 x[6]:0xffff848a9020 x[7]:0xffff84695c00 x[8]:0x63 x[9]:0x30 x[10]:0xaaaacde00fb8 x[11]:0x10f00 x[12]:0x0 x[13]:0x0 x[14]:0x1 x[15]:0xffffffff x[16]:0xaaaacde11f48 x[17]:0xffff84736b10 x[18]:0x2 x[19]:0xffffe2e4a288 x[20]:0x1 x[21]:0xaaaacde11d00 x[22]:0xffff848ba040 x[23]:0xaaaacde014f0 x[24]:0xffff8481b000 x[25]:0x0 x[26]:0xffffe2e4a298 x[27]:0xaaaacde11d00 x[28]:0x0 x[29]:0xffff8467ff80 x[30]:0xaaaacde0133c sp:0xffff8467ff80 pc:0xaaaacde01338 pstate:0x60001000
}
{
  real_time:1808.900300 global_time:190, event:`SYSCALL: sched_yield' (state:EXITING_SYSCALL) tid:2567752, ticks:435
x[0]:0x0 x[1]:0xe x[2]:0x0 x[3]:0x0 x[4]:0xffff846f72f0 x[5]:0x64 x[6]:0xfffffff9 x[7]:0x0 x[8]:0x7c x[9]:0xffff8467f861 x[10]:0x0 x[11]:0xa x[12]:0x272e7c x[13]:0x0 x[14]:0x1 x[15]:0xffffffff x[16]:0xaaaacde11f90 x[17]:0xffff84857344 x[18]:0x2 x[19]:0xffffe2e4a288 x[20]:0x1 x[21]:0xaaaacde11d00 x[22]:0xffff848ba040 x[23]:0xaaaacde014f0 x[24]:0xffff8481b000 x[25]:0x0 x[26]:0xffffe2e4a298 x[27]:0xaaaacde11d00 x[28]:0x0 x[29]:0xffff8467ff70 x[30]:0xaaaacde013e4 sp:0xffff8467ff70 pc:0xffff8485735c pstate:0x80001000
}

Sep 17 '23 21:09 rocallahan

I can reproduce this in about 1 run out of 100 parallel rseq runs.

Sep 17 '23 21:09 rocallahan

The instruction that we're aiming for and missing is at 0xaaaacde01338:

   0x0000aaaacde01338 <+684>:	bl	0xaaaacde00ea8 <do_section>

Sep 17 '23 21:09 rocallahan

This failure is nondeterministic during replay --- when I repeat the debug script with a recording that failed during my parallel test runs, it works.

Sep 17 '23 21:09 rocallahan

This is exactly the same code that passes 1000 parallel cont_race runs and also passed 100% of tests other than rseq so I think this is a problem specific to rseq, but I'm not 100% sure of that.

Sep 17 '23 21:09 rocallahan

OK here is a particularly simple failure:

[ReplaySession] [event 173] 2604702: replaying SYSCALLBUF_FLUSH; state  (none)
[ReplaySession]     (syscllbufsz:0, abrtcmt:0, locked:0)
[ReplaySession] Prepared 16 bytes of syscall records
[ReplaySession] Adding breakpoint
[ReplaySession] Removing breakpoint 0xb7f (STOP-SIGSEGV)
[ReplaySession] [event 174] 2604702: replaying SCHED; state  (none)
[ReplaySession]     (syscllbufsz:16, abrtcmt:0, locked:0)
[ReplaySession] advancing 1820964 ticks to reach 1821058/0xaaaacc910f78
[ReplaySession]   programming interrupt for 1819964 ticks
[FATAL src/PerfCounters.cc:824:read_ticks()] 
 (task 2610778 (rec:2604702) at time 174)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 3639959 ticks, expected no more than 1820964
Tail of trace dump:
...
{
  real_time:3174.150590 global_time:173, event:`SYSCALLBUF_FLUSH' tid:2604702, ticks:1821058
  { syscall:'set_robust_list', ret:0x0, size:0x10 }
  { tid:2604702, addr:0x70010480, length:0x4 }
}
{
  real_time:3174.150597 global_time:174, event:`SCHED' tid:2604702, ticks:1821058
x[0]:0xaaaacc922060 x[1]:0x4d2 x[2]:0x4d2 x[3]:0xffffa6020038 x[4]:0xffffa6020190 x[5]:0x1 x[6]:0xffffa600e020 x[7]:0xffffa5df5c00 x[8]:0x63 x[9]:0x30 x[10]:0xaaaacc910fb8 x[11]:0x10f00 x[12]:0x0 x[13]:0x0 x[14]:0x1 x[15]:0xffffffff x[16]:0xaaaacc921f48 x[17]:0xffffa5e96b10 x[18]:0x2 x[19]:0xffffcf7c4f18 x[20]:0x1 x[21]:0xaaaacc921d00 x[22]:0xffffa601f040 x[23]:0xaaaacc9114f0 x[24]:0xffffa5f7b000 x[25]:0x0 x[26]:0xffffcf7c4f28 x[27]:0xaaaacc921d00 x[28]:0x0 x[29]:0xffffa5ddff60 x[30]:0xaaaacc91133c sp:0xffffa5ddff60 pc:0xaaaacc910f78 pstate:0x60001000
}
{
  real_time:3174.150602 global_time:175, event:`SYSCALLBUF_RESET' tid:2604702, ticks:1821058
}

Sep 17 '23 21:09 rocallahan

It's pretty clear that we just missed an interrupt and kept going until the counter rolled over again. Note that 1819964*2 = 3639928 which is just a little less than 3639959. So I suspect a kernel bug where an interrupt occurring at the wrong time during rseq processing is dropped.

Things to try next:

This is kernel 6.2. Is it fixed in newer kernels?
Try to write a reduced testcase not involving rr.

Sep 17 '23 21:09 rocallahan

So I suspect a kernel bug where an interrupt occurring at the wrong time during rseq processing is dropped.

This is obviously wrong. The bug occurs during replay, not recording, and we purely emulate rseq during replay so it can't be involved.

I suspect this is a generic PMU interrupt bug that shows up in the rseq test because that test happens to require a lot of PMU interrupts during replay because the child process is interrupted by many SCHED events.

Sep 18 '23 07:09 rocallahan

https://github.com/rr-debugger/rr/commit/301901f4e01cdf0eb290db72f3e64ec97801ebca extends the standalone counters test to support Aarch64 and also support more intensive testing of interrupts. Running

for i in `seq 1 100`; do ~/obj/bin/counters 1000 10000 > /tmp/output$i & done

on Ubuntu 22 LTS (6.2.0-1012-aws kernel) I get 20 failures out of 100 tests (each of which tries 1000 interrupts and fails at the first one that misses the skid). 18 of those failures interrupted at around 20K ticks instead of 10K. One failure interrupted at 30K ticks and one at 40K ticks.

Sep 21 '23 12:09 rocallahan

It's interesting that our tests pass with this level of unreliability. I would say this makes c6g.8xlarge unusable.

Sep 21 '23 12:09 rocallahan

Is metal ok? Do you have the connection to the AWS Graviton team, or would you like me to make an intro? I haven't spoken to them in a few years, but they were very helpful when I did the original Aarch64 port.

Sep 21 '23 12:09 Keno

c7g.8xlarge is much better, but I still one failure every few thousand test runs.

Sep 21 '23 13:09 rocallahan

Still get failures on c6g.metal, but at a lower rate, one or two per 500 runs. Maybe fewer just because it has more cores.

Sep 21 '23 13:09 rocallahan

Do you have the connection to the AWS Graviton team, or would you like me to make an intro?

I don't. An intro would be great.

Sep 21 '23 13:09 rocallahan

Please CC me on that too.

Sep 21 '23 13:09 khuey

I stripped down that counters.cc file as much as possible to the simplest reproducer. This is aarch64-only (github won't allow attaching a .cc file, so it's renamed to .txt): minimal-aarch64-counters.txt

If you run it as-is it will show the same issue as before

-bash-4.2$ ./a.out
Pid 42151 running 4294967295 interrupts with period 10000 skid size 1000
Interrupted after 20001 ticks on interrupt 2650640, outside expected interrupt period 10000 +/- skid size 1000
Interrupted after 19998 ticks on interrupt 2650660, outside expected interrupt period 10000 +/- skid size 1000

The timings are tighter because it isn't doing the counter resetting on each interrupt as the original was.

What I've found is that the key to the issue is core migrations. You can ping-pong the process between cores and dramatically increase the rate at which the problem is observed:

while : ; do for i in 0x1 0x2 ; do taskset -p $i 42151 ; done ; done

From my experiments using the above, I've found that

this happens on kernels 5.4, 5.10 and 6.1 (AL2 for the formers and AL2023 for the latter)
this happens very frequently on c6g.12xlarge
this happens very infrequently on c7g.12xlarge
this happens a lot less frequently on c6g.metal than c6g.12xlarge
this happens a lot more frequently on c7g.metal than c7g.12xlarge
this happens about the same frequency on c6g.metal and c7g.metal

I only ran it for about an hour on x86, but on a c6i instance I saw no sign of it on that platform.

I haven't done anything towards identifying where it's happening, but I suspect it has to do with the interrupt happening around the time of the process migrating to another core. These are not synchronous exceptions, so they could come in once the process has been moved to another core and the kernel will no longer have the state necessary to appropriately handle it. That's just a theory at the moment.

Oct 23 '23 18:10 wash-amzn

I made the following post to the linux-perf-users list describing what I've found:

armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

An issue [1] was opened in the rr-debugger project reporting occasional missed
perf event overflow signals on arm64. I've been digging into this and think I
understand what's happening, but wanted to confirm my understanding.

The attached example application, derived from an rr-debugger test case, reports
when the value of a counter doesn't increase by the expected period +/- some
tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high
frequency, it frequently reports increases of ~2x the expected. I've confirmed
this same behavior on kernels 5.4, 5.10, 6.1 and 6.5.

I found armv8pmu_disable_intens [2] that is called as part of event
de-scheduling and contains
    /* Clear the overflow flag in case an interrupt is pending. */
    write_pmovsclr(mask);
which results in any pending overflow interrupt being dropped. I added some
debug output here and indeed there is a correlation of this bit being high at
the point of the above code and the reproducer identifying a missed signal.

This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1)
enabled.

When an event is not being explicitly torn down (e.g. being closed), this seems
like an undesirable behavior. I haven't attempted to demo it yet, but I suspect
an application disabling an event temporarily could occasionally see the same
missed overflow signals. Is my understanding here correct? Does anyone have
thoughts on how this could be addressed without creating other issues?

[1] https://github.com/rr-debugger/rr/issues/3607
[2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43

I don't have any idea if there's going to be a fix proposed, made, or much less accepted. At this point the best workaround is to use pseudo-NMIs (add kernel boot argument irqchip.gicv3_pseudo_nmi=1).

Nov 20 '23 18:11 wash-amzn

Thanks for tracking that down. Unfortunate that you didn't get any feedback. It does look like the overflow state needs to be saved and restored if/when the counter is re-enabled. Is there anyone whose job covers fixing bugs in this code?

Nov 24 '23 20:11 rocallahan

How hard of a dependency does rr have on never missing an overflow signal handler here? Does that assertion failure need to be fatal?

Nov 27 '23 15:11 wash-amzn

Is there anyone whose job covers fixing bugs in this code?

I would suggest just bumping the thread this week, perhaps with a direct cc to Will (now at Google formerly at ARM) and Catalin (still at ARM) who wrote this code. I've usually found them quite responsive, but this came in over the holiday week of course, so they probably just missed it.

Nov 27 '23 21:11 Keno

Mark Rutland has made some comments on that email thread, and is apparently looking into a fix:

I'll get back to looking at how we can better capture the overflow when removing an event.

Dec 01 '23 21:12 rocallahan

Does that assertion failure need to be fatal?

Yes. During replay, we request a PMU interrupt to stop program execution before reaching a point in the execution where some asynchronous event (e.g. a signal delivery) needs to be replayed. If that interrupt doesn't happen, the replay will continue past the point where the async event should have occurred, so we'll inject the async event late, when the program is the wrong state. The replay no longer matches what happened during recording.

Theoretically we could work around this by cloning the program state before requesting the PMU interrupt, and if the PMU interrupt is missed, restoring state from the clone and retrying. But that would be complex and slow and we certainly wouldn't do it just to work around a bug in one specific PMU driver.

Dec 01 '23 21:12 rocallahan

How do you accurately inject these events? Even without missed signals here, these signals' timing isn't exact.

Dec 01 '23 21:12 wash-amzn

We assume that there is a maximum number of events that can occur between the count overflowing and the program being interrupted --- we call this the "skid" and configure it per-microarch based on experiments. Then we configure the interrupt to happen "early enough" and after the interrupt occurs, use breakpoints to advance program execution to the correct injection point.

We don't want to use breakpoints all the time, because that approach can be really slow.

Dec 01 '23 22:12 rocallahan

In https://github.com/rr-debugger/rr/commit/b1a5ffbfa6a07899a54c56288731f5472521259a I just had to modify one of our tests because it was replaying incredibly slowly on Zen, which requires a very high skid value (10K events). The test had a tight loop that retires one conditional branch per iteration. During replay, we need to deliver an async signal, and the PMU interrupt stops the program about 8K events early. Then to advance to the actual moment when the signal needs to be delivered, we set a breakpoint at the address the program was at when the signal was delivered during recording and PTRACE_CONT to that breakpoint 8K times. This was excruciatingly slow.

Dec 01 '23 22:12 rocallahan

Theoretically we could work around this by cloning the program state before requesting the PMU interrupt, and if the PMU interrupt is missed, restoring state from the clone and retrying. But that would be complex and slow and we certainly wouldn't do it just to work around a bug in one specific PMU driver.

Would it be possible to not bother with cloning directly before and try again from whatever other clone there is (i.e. incl. from the very start of execution)? Would of course sometimes be quite slow, but better than a crash, and doesn't cost anything when things stay within skid (and could perhaps even allow lowering the default skid).

Dec 01 '23 22:12 dzaima

That might work OK for rr replay with a debugger, where we take fairly frequent checkpoints to make reverse-stepping fast. Wouldn't really work for rr replay -a and other non-debugger uses of rr (e.g. in Pernosco).

Surfacing these errors instead of silently giving glitchy performance helps us get these bugs fixed, so there's a tradeoff there.

Dec 01 '23 22:12 rocallahan

Mark's latest post https://lore.kernel.org/linux-perf-users/ZW9ooVL32WB8U4YB@FVFF77S0Q05N/ has prototype changes to address this, but he also warns it might be too big to be back-ported to older kernels.

Dec 05 '23 18:12 wash-amzn

I don't think we really care about backports as long as it gets fixed on mainline.

Dec 05 '23 19:12 rocallahan