Reproducible assertion failure in check_ticks_consistency
I am running into an internal rr assertion failure. The program I originally ran into this assertion failure on is a bit more complicated, but I have boiled it down to a smaller example:
1 #include <stdio.h>
2
3 #define SIZE 97
4
5 int main () {
6 int result;
7 int ia[SIZE];
8 int i = 0;
9 int some_number = 1234;
10
11 printf("begin\n");
12
13 for (i = 0; i < SIZE; ++i) {
14 ia[i] = some_number;
15 }
16
17 result = ia[SIZE-10];
18 printf("result = %d\n", result);
19
20 return 0;
21 }
I am using gcc -g -O0 source01.c -o a.out (gcc version 4.8.4) to compile the program, and I am running rr 4.4.0. I compile the program, run rr record ./a.out, and run rr replay. Then, run the following sequence of commands:
break source01.c:18
continue
watch -l result
reverse-cont
I get the following error from the reverse-cont:
[FATAL /home/roc/rr/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS]
(task 32134 (rec:32128) at time 124)
-> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mprotect'; expected 19400, got 19401
Launch gdb with
gdb '-l' '-1' '-ex' 'target extended-remote :32134' /home/bddicken/.local/share/rr/latest-trace/mmap_hardlink_3_a.out
Getting this error is reproducible as well. Interestingly, I tried several values for SIZE of 97 and above (97, 98, 99, 100) and I get this error for all of those values. I also tried several values less than 97 (96, 88, 75, 50) and I do not get this assertion error for any of those values. Perhaps this has something to do with it.
I see issues #1734 and #1771 report similar-looking errors. However, both of those issues remain open, and this error is slightly different.
Can someone else reproduce this? And does anyone know of a solution? Thanks.
Sounds to me like there's a bug when ia extends across a stack page boundary, which would mean this is hard to reproduce—for starters, where the stack page boundaries are depends on your environment variables, or used to. Stack layout randomization should also mean this bug isn't reproducible unless that has been disabled.
If I'm right, printf("%p %p\n", ia, ia+SIZE);, inserted anywhere in the program, should produce two hex numbers that differ in the fourth-to-last digit.
Sounds to me like there's a bug when ia extends across a stack page boundary, which would mean this is hard to reproduce
Interesting. Looks like ASLR is enabled on my system:
$ cat /proc/sys/kernel/randomize_va_space
2
what makes you think that is the problem? Is rr known to have trouble with this? Considering that I can reproduce the problem across multiple runs, It would seem odd for this to be the issue.
If I'm right,
printf("%p %p\n", ia, ia+SIZE);, inserted anywhere in the program, should produce two hex numbers that differ in the fourth-to-last digit.
I tried doing this with SIZE being 96 and got 0x7fffe91a6000 0x7fffe91a6180. With SIZE being 97 I got 0x7fff46655560 0x7fff466556e4. No differences in the fourth-to-last digit in either.
What assembly does gcc generate for that loop? The REP prefixes are tricky to get right, it's possible you found another bug in our handling of them.
Here is the relevant snipped from the versions compiled with 96 and 97:
96:
printf("begin\n");
40059c: bf b4 06 40 00 mov $0x4006b4,%edi
4005a1: e8 aa fe ff ff callq 400450 <puts@plt>
for (i = 0; i < SIZE; ++i) {
4005a6: c7 85 74 fe ff ff 00 movl $0x0,-0x18c(%rbp)
4005ad: 00 00 00
4005b0: eb 1c jmp 4005ce <main+0x51>
ia[i] = some_number;
4005b2: 8b 85 74 fe ff ff mov -0x18c(%rbp),%eax
4005b8: 48 98 cltq
4005ba: 8b 95 78 fe ff ff mov -0x188(%rbp),%edx
4005c0: 89 94 85 80 fe ff ff mov %edx,-0x180(%rbp,%rax,4)
int i = 0;
int some_number = 1234;
printf("begin\n");
for (i = 0; i < SIZE; ++i) {
4005c7: 83 85 74 fe ff ff 01 addl $0x1,-0x18c(%rbp)
4005ce: 83 bd 74 fe ff ff 5f cmpl $0x5f,-0x18c(%rbp)
4005d5: 7e db jle 4005b2 <main+0x35>
ia[i] = some_number;
}
result = ia[SIZE-10];
4005d7: 8b 45 d8 mov -0x28(%rbp),%eax
4005da: 89 85 7c fe ff ff mov %eax,-0x184(%rbp)
printf("result = %d\n", result);
4005e0: 8b 85 7c fe ff ff mov -0x184(%rbp),%eax
4005e6: 89 c6 mov %eax,%esi
4005e8: bf ba 06 40 00 mov $0x4006ba,%edi
4005ed: b8 00 00 00 00 mov $0x0,%eax
4005f2: e8 69 fe ff ff callq 400460 <printf@plt>
97:
printf("begin\n");
40059c: bf b4 06 40 00 mov $0x4006b4,%edi
4005a1: e8 aa fe ff ff callq 400450 <puts@plt>
for (i = 0; i < SIZE; ++i) {
4005a6: c7 85 64 fe ff ff 00 movl $0x0,-0x19c(%rbp)
4005ad: 00 00 00
4005b0: eb 1c jmp 4005ce <main+0x51>
ia[i] = some_number;
4005b2: 8b 85 64 fe ff ff mov -0x19c(%rbp),%eax
4005b8: 48 98 cltq
4005ba: 8b 95 68 fe ff ff mov -0x198(%rbp),%edx
4005c0: 89 94 85 70 fe ff ff mov %edx,-0x190(%rbp,%rax,4)
int i = 0;
int some_number = 1234;
printf("begin\n");
for (i = 0; i < SIZE; ++i) {
4005c7: 83 85 64 fe ff ff 01 addl $0x1,-0x19c(%rbp)
4005ce: 83 bd 64 fe ff ff 60 cmpl $0x60,-0x19c(%rbp)
4005d5: 7e db jle 4005b2 <main+0x35>
ia[i] = some_number;
}
result = ia[SIZE-10];
4005d7: 8b 45 cc mov -0x34(%rbp),%eax
4005da: 89 85 6c fe ff ff mov %eax,-0x194(%rbp)
printf("result = %d\n", result);
4005e0: 8b 85 6c fe ff ff mov -0x194(%rbp),%eax
4005e6: 89 c6 mov %eax,%esi
4005e8: bf ba 06 40 00 mov $0x4006ba,%edi
4005ed: b8 00 00 00 00 mov $0x0,%eax
4005f2: e8 69 fe ff ff callq 400460 <printf@plt>
And a visual diff, for convenience:

I have also attached the full results of objdump -S -D for each:
source01_SIZE_96_objdump.txt
source01_SIZE_97_objdump.txt
Ok, given that, I think it's more likely it's related to the size of the stack frame.
Is this bare metal or in a VM?
Can you run with RR_LOG=ReplayTimeline and dump that output? Also, can you dump the results of rr dump around event 124?
Is this bare metal or in a VM?
Bare metal. EDIT: More specifically: Bare-metal, Ubuntu 14.04.4, Kernel 4.2.0-42-generic
you run with RR_LOG=ReplayTimeline and dump that output?
(rr) break source01.c:18
Breakpoint 1 at 0x4005e0: file source01.c, line 18.
(rr) continue
Continuing.
begin
Breakpoint 1, main () at source01.c:18
18 printf("result = %d\n", result);
(rr) watch -l result
Hardware watchpoint 2: -location result
(rr) reverse-cont
Continuing.
[ReplayTimeline] ReplayTimeline::reverse_continue from {time:181,ticks:42649,st:1,regs_ip:0x4005e0}
[ReplayTimeline] Seeked backward from {time:181,ticks:42649,st:1,regs_ip:0x4005e0} to {time:15,ticks:0,st:0,regs_ip:0x7f9d55dc82d0}
[ReplayTimeline] Creating short-checkpoint at {time:29,ticks:455,st:1,regs_ip:0x7f9d55dd2ab9}
[ReplayTimeline] Found watch break at {time:29,ticks:455,st:1,regs_ip:0x7f9d55dd2ab9}
[ReplayTimeline] Found watch break at {time:33,ticks:845,st:3,regs_ip:0x7f9d55dcf800}
[ReplayTimeline] Found breakpoint break at {time:33,ticks:869,st:3,regs_ip:0x7f9d55dd7970}
[ReplayTimeline] Found watch break at {time:50,ticks:1502,st:3,regs_ip:0x7f9d55dc84c2}
[FATAL /home/roc/rr/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS]
(task 18937 (rec:18931) at time 124)
-> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mprotect'; expected 19439, got 19440
Launch gdb with
gdb '-l' '-1' '-ex' 'target extended-remote :18937' /home/bddicken/.local/share/rr/latest-trace/mmap_hardlink_3_a.out
Also, can you dump the results of rr dump around event 124?
{
real_time:33704.703660 global_time:119, event:`SYSCALL: mmap' (state:ENTERING_SYSCALL) tid:18931, ticks:7256
rax:0xffffffffffffffda rbx:0x1040 rcx:0xffffffffffffffff rdx:0x3 rsi:0x2000 rdi:0x0 rbp:0x2000 rsp:0x7ffc579bdfb8 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55de065a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x0 gs_base:0x0
}
{
real_time:33704.703671 global_time:120, event:`SYSCALL: mmap' (state:EXITING_SYSCALL) tid:18931, ticks:7256
rax:0x7f9d55fce000 rbx:0x1040 rcx:0xffffffffffffffff rdx:0x3 rsi:0x2000 rdi:0x0 rbp:0x2000 rsp:0x7ffc579bdfb8 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55de065a eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9 fs_base:0x0 gs_base:0x0
}
{
real_time:33704.703691 global_time:121, event:`SYSCALL: arch_prctl' (state:ENTERING_SYSCALL) tid:18931, ticks:7271
rax:0xffffffffffffffda rbx:0x1 rcx:0xffffffffffffffff rdx:0x7f9d55fcf050 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be000 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55dc7fde eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9e fs_base:0x0 gs_base:0x0
}
{
real_time:33704.703699 global_time:122, event:`SYSCALL: arch_prctl' (state:EXITING_SYSCALL) tid:18931, ticks:7271
rax:0x0 rbx:0x1 rcx:0xffffffffffffffff rdx:0x7f9d55fcf050 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be000 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x2ec20bbf6d50 r15:0x7f9d55feb1c8 rip:0x7f9d55dc7fde eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0x9e fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703711 global_time:123, event:`SEGV_RDTSC' tid:18931, ticks:7280
rax:0xc14d78e rbx:0x0 rcx:0x0 rdx:0x2ec2 rsi:0x7f9d55fce740 rdi:0x1002 rbp:0x7ffc579be170 rsp:0x7ffc579be010 r8:0xffffffff r9:0x0 r10:0x22 r11:0x246 r12:0x1 r13:0x7ffc579be120 r14:0x0 r15:0x7f9d55feb1c8 rip:0x7f9d55dcaa7d eflags:0x10202 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xffffffffffffffff fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703767 global_time:124, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:19439
rax:0xffffffffffffffda rbx:0x7f9d557d5000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x4000 rdi:0x7f9d55b8f000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x2c06f28 r9:0x7f9d55feb760 r10:0x7ffc579bdca0 r11:0x246 r12:0x7f9d557f43d0 r13:0x7f9d557f43b8 r14:0x3be018 r15:0x7f9d55fe89b0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703781 global_time:125, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:19439
rax:0x0 rbx:0x7f9d557d5000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x4000 rdi:0x7f9d55b8f000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x2c06f28 r9:0x7f9d55feb760 r10:0x7ffc579bdca0 r11:0x246 r12:0x7f9d557f43d0 r13:0x7f9d557f43b8 r14:0x3be018 r15:0x7f9d55fe89b0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703801 global_time:126, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:21858
rax:0xffffffffffffffda rbx:0x7f9d553b3000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d555cb000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d553b46f0 r15:0x7f9d55fd14f0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703814 global_time:127, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:21858
rax:0x0 rbx:0x7f9d553b3000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d555cb000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d553b46f0 r15:0x7f9d55fd14f0 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703832 global_time:128, event:`SYSCALL: mprotect' (state:ENTERING_SYSCALL) tid:18931, ticks:23861
rax:0xffffffffffffffda rbx:0x7f9d555d1000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d557d3000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d555d15b8 r15:0x7f9d55fd1000 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
{
real_time:33704.703845 global_time:129, event:`SYSCALL: mprotect' (state:EXITING_SYSCALL) tid:18931, ticks:23861
rax:0x0 rbx:0x7f9d555d1000 rcx:0xffffffffffffffff rdx:0x1 rsi:0x1000 rdi:0x7f9d557d3000 rbp:0x7ffc579be000 rsp:0x7ffc579bdec8 r8:0x7f9d55fe89b0 r9:0x7f9d55fe89b0 r10:0x0 r11:0x246 r12:0x0 r13:0x0 r14:0x7f9d555d15b8 r15:0x7f9d55fd1000 rip:0x7f9d55de06a7 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xa fs_base:0x7f9d55fce740 gs_base:0x0
}
Thanks! Mysterious ... we'll have to take some shots in the dark.
If you do run instead of reverse-continue, you should see that watchpoint being hit several times corresponding to the hits in the ReplayTimeline log. It would be worth getting a stack trace for each of those, with a disassembly around each instruction where the watchpoint was hit.
Another thing to try would be doing rr record --checksum=on-all-events and then rr replay --checksum=on-all-events and see if that produces anything interesting.
If you do
runinstead ofreverse-continue, you should see that watchpoint being hit several times corresponding to the hits in the ReplayTimeline log.
I don't really understand. Doesn't executing run just re-start the execution from the beginning? How will that cause the watchpoint to be hit? I tried doing so and didn't see it get hit at all.
Another thing to try would be doing
rr record --checksum=on-all-eventsand thenrr replay --checksum=on-all-eventsand see if that produces anything interesting.
I tried this, and the run looks the the same as it did without checksums :(
I can reproduce this.
Unfortunately I haven't made much progress here. The only interesting thing I've noticed is that one of the writes to the watchpoint comes from a syscall result (specifically a read that fills a buffer which includes the address that ultimately becomes result). That write is not reported by gdb. It happens some time before things go off the rails and we hit the ticks mismatch though.
Although it may not be being reported because the value did not change.
I am not very familiar with rr/gdb internals, so not sure how much of a help I can be here. If there is anything I can do to help, let me know.
Doesn't executing run just re-start the execution from the beginning?
Yes
How will that cause the watchpoint to be hit?
That memory location will be written a few times during the course of execution.
I tried doing so and didn't see it get hit at all.
Odd. Maybe if you try taking the address of result manually and then watch -l *(int*)<addr>?
@khuey can you verify using checksums that the memory state before the final, fatal replay step is exactly right? Another thing you can do is patch ReplayTimeline so that when we do the final, fatal replay step, we single-step through some number of instructions and then do a continue. Knowing the minimum execution segment that must be executed continuously to trigger the divergence would be helpful.
Is this still an issue with a recent rr?