rr icon indicating copy to clipboard operation
rr copied to clipboard

Reproducible assertion failure in check_ticks_consistency

Open bddicken opened this issue 9 years ago • 15 comments

I am running into an internal rr assertion failure. The program I originally ran into this assertion failure on is a bit more complicated, but I have boiled it down to a smaller example:

  1 #include <stdio.h>
  2 
  3 #define SIZE 97
  4 
  5 int main () {
  6   int result;
  7   int ia[SIZE];
  8   int i = 0;
  9   int some_number = 1234;
 10 
 11   printf("begin\n");
 12 
 13   for (i = 0; i < SIZE; ++i) {
 14     ia[i] = some_number;
 15   }
 16 
 17   result = ia[SIZE-10];
 18   printf("result = %d\n", result);
 19 
 20   return 0;
 21 }

I am using gcc -g -O0 source01.c -o a.out (gcc version 4.8.4) to compile the program, and I am running rr 4.4.0. I compile the program, run rr record ./a.out, and run rr replay. Then, run the following sequence of commands:

break source01.c:18
continue
watch -l result
reverse-cont

I get the following error from the reverse-cont:

[FATAL /home/roc/rr/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS] 
 (task 32134 (rec:32128) at time 124)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mprotect'; expected 19400, got 19401
Launch gdb with 
  gdb '-l' '-1' '-ex' 'target extended-remote :32134' /home/bddicken/.local/share/rr/latest-trace/mmap_hardlink_3_a.out

Getting this error is reproducible as well. Interestingly, I tried several values for SIZE of 97 and above (97, 98, 99, 100) and I get this error for all of those values. I also tried several values less than 97 (96, 88, 75, 50) and I do not get this assertion error for any of those values. Perhaps this has something to do with it.

I see issues #1734 and #1771 report similar-looking errors. However, both of those issues remain open, and this error is slightly different.

Can someone else reproduce this? And does anyone know of a solution? Thanks.

bddicken avatar Oct 20 '16 16:10 bddicken

Sounds to me like there's a bug when ia extends across a stack page boundary, which would mean this is hard to reproduce—for starters, where the stack page boundaries are depends on your environment variables, or used to. Stack layout randomization should also mean this bug isn't reproducible unless that has been disabled.

If I'm right, printf("%p %p\n", ia, ia+SIZE);, inserted anywhere in the program, should produce two hex numbers that differ in the fourth-to-last digit.

pipcet avatar Oct 20 '16 16:10 pipcet

Sounds to me like there's a bug when ia extends across a stack page boundary, which would mean this is hard to reproduce

Interesting. Looks like ASLR is enabled on my system:

$ cat /proc/sys/kernel/randomize_va_space
2

what makes you think that is the problem? Is rr known to have trouble with this? Considering that I can reproduce the problem across multiple runs, It would seem odd for this to be the issue.

If I'm right, printf("%p %p\n", ia, ia+SIZE);, inserted anywhere in the program, should produce two hex numbers that differ in the fourth-to-last digit.

I tried doing this with SIZE being 96 and got 0x7fffe91a6000 0x7fffe91a6180. With SIZE being 97 I got 0x7fff46655560 0x7fff466556e4. No differences in the fourth-to-last digit in either.

bddicken avatar Oct 20 '16 17:10 bddicken

What assembly does gcc generate for that loop? The REP prefixes are tricky to get right, it's possible you found another bug in our handling of them.

khuey avatar Oct 20 '16 17:10 khuey

Here is the relevant snipped from the versions compiled with 96 and 97:

96:

  printf("begin\n");
  40059c:       bf b4 06 40 00          mov    $0x4006b4,%edi
  4005a1:       e8 aa fe ff ff          callq  400450 <puts@plt>

  for (i = 0; i < SIZE; ++i) {
  4005a6:       c7 85 74 fe ff ff 00    movl   $0x0,-0x18c(%rbp)
  4005ad:       00 00 00  
  4005b0:       eb 1c                   jmp    4005ce <main+0x51>
    ia[i] = some_number;
  4005b2:       8b 85 74 fe ff ff       mov    -0x18c(%rbp),%eax
  4005b8:       48 98                   cltq   
  4005ba:       8b 95 78 fe ff ff       mov    -0x188(%rbp),%edx
  4005c0:       89 94 85 80 fe ff ff    mov    %edx,-0x180(%rbp,%rax,4)
  int i = 0;
  int some_number = 1234;

  printf("begin\n");

  for (i = 0; i < SIZE; ++i) {
  4005c7:       83 85 74 fe ff ff 01    addl   $0x1,-0x18c(%rbp)
  4005ce:       83 bd 74 fe ff ff 5f    cmpl   $0x5f,-0x18c(%rbp)
  4005d5:       7e db                   jle    4005b2 <main+0x35>
    ia[i] = some_number;
  }

  result = ia[SIZE-10];
  4005d7:       8b 45 d8                mov    -0x28(%rbp),%eax
  4005da:       89 85 7c fe ff ff       mov    %eax,-0x184(%rbp)
  printf("result = %d\n", result);
  4005e0:       8b 85 7c fe ff ff       mov    -0x184(%rbp),%eax
  4005e6:       89 c6                   mov    %eax,%esi
  4005e8:       bf ba 06 40 00          mov    $0x4006ba,%edi
  4005ed:       b8 00 00 00 00          mov    $0x0,%eax
  4005f2:       e8 69 fe ff ff          callq  400460 <printf@plt>

97:

  printf("begin\n");
  40059c:       bf b4 06 40 00          mov    $0x4006b4,%edi
  4005a1:       e8 aa fe ff ff          callq  400450 <puts@plt>

  for (i = 0; i < SIZE; ++i) {
  4005a6:       c7 85 64 fe ff ff 00    movl   $0x0,-0x19c(%rbp)
  4005ad:       00 00 00 
  4005b0:       eb 1c                   jmp    4005ce <main+0x51>
    ia[i] = some_number;
  4005b2:       8b 85 64 fe ff ff       mov    -0x19c(%rbp),%eax
  4005b8:       48 98                   cltq
  4005ba:       8b 95 68 fe ff ff       mov    -0x198(%rbp),%edx
  4005c0:       89 94 85 70 fe ff ff    mov    %edx,-0x190(%rbp,%rax,4)
  int i = 0;
  int some_number = 1234;

  printf("begin\n");

  for (i = 0; i < SIZE; ++i) {
  4005c7:       83 85 64 fe ff ff 01    addl   $0x1,-0x19c(%rbp)
  4005ce:       83 bd 64 fe ff ff 60    cmpl   $0x60,-0x19c(%rbp)
  4005d5:       7e db                   jle    4005b2 <main+0x35>
    ia[i] = some_number; 
  }

  result = ia[SIZE-10];
  4005d7:       8b 45 cc                mov    -0x34(%rbp),%eax
  4005da:       89 85 6c fe ff ff       mov    %eax,-0x194(%rbp)
  printf("result = %d\n", result);
  4005e0:       8b 85 6c fe ff ff       mov    -0x194(%rbp),%eax
  4005e6:       89 c6                   mov    %eax,%esi
  4005e8:       bf ba 06 40 00          mov    $0x4006ba,%edi
  4005ed:       b8 00 00 00 00          mov    $0x0,%eax
  4005f2:       e8 69 fe ff ff          callq  400460 <printf@plt>

And a visual diff, for convenience:

96_97_asm_diff

I have also attached the full results of objdump -S -D for each: source01_SIZE_96_objdump.txt source01_SIZE_97_objdump.txt

bddicken avatar Oct 20 '16 17:10 bddicken

Ok, given that, I think it's more likely it's related to the size of the stack frame.

khuey avatar Oct 20 '16 17:10 khuey

Is this bare metal or in a VM?

Can you run with RR_LOG=ReplayTimeline and dump that output? Also, can you dump the results of rr dump around event 124?

rocallahan avatar Oct 20 '16 19:10 rocallahan

Is this bare metal or in a VM?

Bare metal. EDIT: More specifically: Bare-metal, Ubuntu 14.04.4, Kernel 4.2.0-42-generic

you run with RR_LOG=ReplayTimeline and dump that output?

(rr) break source01.c:18
Breakpoint 1 at 0x4005e0: file source01.c, line 18.
(rr) continue
Continuing.
begin

Breakpoint 1, main () at source01.c:18
18    printf("result = %d\n", result);
(rr) watch -l result
Hardware watchpoint 2: -location result
(rr) reverse-cont
Continuing.
[ReplayTimeline] ReplayTimeline::reverse_continue from {time:181,ticks:42649,st:1,regs_ip:0x4005e0}
[ReplayTimeline] Seeked backward from {time:181,ticks:42649,st:1,regs_ip:0x4005e0} to {time:15,ticks:0,st:0,regs_ip:0x7f9d55dc82d0}
[ReplayTimeline] Creating short-checkpoint at {time:29,ticks:455,st:1,regs_ip:0x7f9d55dd2ab9}
[ReplayTimeline] Found watch break at {time:29,ticks:455,st:1,regs_ip:0x7f9d55dd2ab9}
[ReplayTimeline] Found watch break at {time:33,ticks:845,st:3,regs_ip:0x7f9d55dcf800}
[ReplayTimeline] Found breakpoint break at {time:33,ticks:869,st:3,regs_ip:0x7f9d55dd7970}
[ReplayTimeline] Found watch break at {time:50,ticks:1502,st:3,regs_ip:0x7f9d55dc84c2}
[FATAL /home/roc/rr/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS] 
 (task 18937 (rec:18931) at time 124)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mprotect'; expected 19439, got 19440
Launch gdb with 
  gdb '-l' '-1' '-ex' 'target extended-remote :18937' /home/bddicken/.local/share/rr/latest-trace/mmap_hardlink_3_a.out

Also, can you dump the results of rr dump around event 124?

{
  real_time:33704.703660 global_time:119, event:`SYSCALL: mmap' (state:ENTERING_SYSCALL) tid:18931, ticks:7256
 rax:0xffffffffffffffda rbx:0x1040 rcx:0xffffffffffffffff rdx:0x3 rsi:0x2000 rdi:0x0 rbp:0x2000 rsp:0x7ffc579bdfb8 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55de065a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x0 gs_base:0x0
}
{
  real_time:33704.703671 global_time:120, event:`SYSCALL: mmap' (state:EXITING_SYSCALL) tid:18931, ticks:7256
 rax:0x7f9d55fce000 rbx:0x1040 rcx:0xffffffffffffffff rdx:0x3 rsi:0x2000 rdi:0x0 rbp:0x2000 rsp:0x7ffc579bdfb8 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55de065a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x0 gs_base:0x0
}
{
  real_time:33704.703691 global_time:121, event:`SYSCALL: arch_prctl' (state:ENTERING_SYSCALL) tid:18931, ticks:7271
 rax:0xffffffffffffffda rbx:0x1 rcx:0xffffffffffffffff rdx:0x7f9d55fcf050 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be000 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55dc7fde eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9e fs_base:0x0 gs_base:0x0
}
{
  real_time:33704.703699 global_time:122, event:`SYSCALL: arch_prctl' (state:EXITING_SYSCALL) tid:18931, ticks:7271
 rax:0x0 rbx:0x1 rcx:0xffffffffffffffff rdx:0x7f9d55fcf050 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be000 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55dc7fde eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9e fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703711 global_time:123, event:`SEGV_RDTSC' tid:18931, ticks:7280
 rax:0xc14d78e rbx:0x0 rcx:0x0 rdx:0x2ec2 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be010 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x0 r15:0x7f9d55feb1c8 rip:0x7f9d55dcaa7d eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703767 global_time:124, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:19439
 rax:0xffffffffffffffda rbx:0x7f9d557d5000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x4000 rdi:0x7f9d55b8f000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x2c06f28 r9:0x7f9d55feb760 r10:0x7ffc579bdca0 r11:0x246 r12:0x7f9d557f43d0 r13:0x7f9d557f43b8 r14:0x3be018 r15:0x7f9d55fe89b0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703781 global_time:125, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:19439
 rax:0x0 rbx:0x7f9d557d5000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x4000 rdi:0x7f9d55b8f000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x2c06f28 r9:0x7f9d55feb760 r10:0x7ffc579bdca0 r11:0x246 r12:0x7f9d557f43d0 r13:0x7f9d557f43b8 r14:0x3be018 r15:0x7f9d55fe89b0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703801 global_time:126, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:21858
 rax:0xffffffffffffffda rbx:0x7f9d553b3000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d555cb000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d553b46f0 r15:0x7f9d55fd14f0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703814 global_time:127, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:21858
 rax:0x0 rbx:0x7f9d553b3000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d555cb000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d553b46f0 r15:0x7f9d55fd14f0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703832 global_time:128, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:23861
 rax:0xffffffffffffffda rbx:0x7f9d555d1000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d557d3000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d555d15b8 r15:0x7f9d55fd1000 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
  real_time:33704.703845 global_time:129, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:23861
 rax:0x0 rbx:0x7f9d555d1000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d557d3000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d555d15b8 r15:0x7f9d55fd1000 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}

bddicken avatar Oct 20 '16 19:10 bddicken

Thanks! Mysterious ... we'll have to take some shots in the dark.

If you do run instead of reverse-continue, you should see that watchpoint being hit several times corresponding to the hits in the ReplayTimeline log. It would be worth getting a stack trace for each of those, with a disassembly around each instruction where the watchpoint was hit.

Another thing to try would be doing rr record --checksum=on-all-events and then rr replay --checksum=on-all-events and see if that produces anything interesting.

rocallahan avatar Oct 20 '16 20:10 rocallahan

If you do run instead of reverse-continue, you should see that watchpoint being hit several times corresponding to the hits in the ReplayTimeline log.

I don't really understand. Doesn't executing run just re-start the execution from the beginning? How will that cause the watchpoint to be hit? I tried doing so and didn't see it get hit at all.

Another thing to try would be doing rr record --checksum=on-all-events and then rr replay --checksum=on-all-events and see if that produces anything interesting.

I tried this, and the run looks the the same as it did without checksums :(

bddicken avatar Oct 21 '16 22:10 bddicken

I can reproduce this.

khuey avatar Oct 21 '16 22:10 khuey

Unfortunately I haven't made much progress here. The only interesting thing I've noticed is that one of the writes to the watchpoint comes from a syscall result (specifically a read that fills a buffer which includes the address that ultimately becomes result). That write is not reported by gdb. It happens some time before things go off the rails and we hit the ticks mismatch though.

khuey avatar Oct 27 '16 01:10 khuey

Although it may not be being reported because the value did not change.

khuey avatar Oct 27 '16 01:10 khuey

I am not very familiar with rr/gdb internals, so not sure how much of a help I can be here. If there is anything I can do to help, let me know.

bddicken avatar Oct 27 '16 16:10 bddicken

Doesn't executing run just re-start the execution from the beginning?

Yes

How will that cause the watchpoint to be hit?

That memory location will be written a few times during the course of execution.

I tried doing so and didn't see it get hit at all.

Odd. Maybe if you try taking the address of result manually and then watch -l *(int*)<addr>?

@khuey can you verify using checksums that the memory state before the final, fatal replay step is exactly right? Another thing you can do is patch ReplayTimeline so that when we do the final, fatal replay step, we single-step through some number of instructions and then do a continue. Knowing the minimum execution segment that must be executed continuously to trigger the divergence would be helpful.

rocallahan avatar Oct 27 '16 23:10 rocallahan

Is this still an issue with a recent rr?

GitMensch avatar Jul 26 '21 08:07 GitMensch