rr rr segfaults when recording an execution

I am trying to record a program's execution using rr and it looks like rr is encountering a segmentation fault. The command to record execution is something like rr record -n <program> <arguments>. Aside from rr: Saving execution to trace directory <trace_output_dir>., there doesn't seem to be anything useful output to the terminal(all other output are logs from the program). As suggested in rr chatroom, I tried setting log level using RR_LOG(something like RR_LOG=warn rr record -n <program> <arguments>) but there was no useful output. I even tried setting log level to debug but nothing.

Since there was no useful log output, I attached gdb to rr and got the following traceback:

#0  0x00007f92b0df455b in kill () at ../sysdeps/unix/syscall-template.S:78
#1  0x00000000005134ac in rr::RecordCommand::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) ()
#2  0x0000000000498163 in main ()
#3  0x00007f92b0dd50b3 in __libc_start_main (main=0x497e10 <main>, argc=0x7, argv=0x7fffc2c759e8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffc2c759d8) at ../csu/libc-start.c:308
#4  0x0000000000498579 in _start ()

Unfortunately, I cannot build and run a debug build of rr: some of rr's build dependencies conflicts with other essential packages on my system. I did not find an installable debug build of rr online: if there is one, I can try running that. I am using rr 5.4.0 on Ubuntu 20.04.2 LTS with Linux kernel 5.8.0-59. If any other information would be useful, I can provide that.

Jun 28 '21 00:06 dnivra

I noticed just now that attaching gdb to rr(I do it at a point when program expects an input) and then continuing does not seem to trigger a segmentation fault. Running rr without attaching gdb to it does trigger a segmentation fault: I see the message segmentation fault rr record -n <program> <args>

Jun 28 '21 00:06 dnivra

Can you strace -o <path/to/a/file.log> rr ...? Does that still crash?

Jun 28 '21 00:06 khuey

Yes it does crash when being run inside strace.

Jun 28 '21 00:06 dnivra

Here's some output from end of strace log. If you can tell me what else could be interesting, I can search through the strace logs and share them.

rename("/home/dnivra/.local/share/rr/python3-2/incomplete", "/home/dnivra/.local/share/rr/python3-2/version") = 0
close(7)                                = 0
munmap(0x7fc5900a7000, 200704)          = 0
munmap(0x7fc5900d8000, 200704)          = 0
munmap(0x7fc58b5ff000, 10489856)        = 0
munmap(0x7fc590b3e000, 3149824)         = 0
close(12)                               = 0
close(11)                               = 0
close(8)                                = 0
getpid()                                = 15040
openat(AT_FDCWD, "/proc/15040/maps", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "00400000-00801000 r-xp 00000000 "..., 1024) = 1024
read(3, "0-7fc578000000 ---p 00000000 00:"..., 1024) = 1024
read(3, "1 30939243                   /li"..., 1024) = 1024
read(3, "nu/libc-2.31.so\n7fc591035000-7fc"..., 1024) = 1024
read(3, "7fc591076000 r--p 00018000 fd:01"..., 1024) = 1024
read(3, "1219000 r--p 00024000 fd:01 3093"..., 1024) = 674
read(3, "", 1024)                       = 0
close(3)                                = 0
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[SEGV], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc590e8d210}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc59104e420}, 8) = 0
prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
getpid()                                = 15040
kill(15040, SIGSEGV)                    = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=15040, si_uid=1000} ---
+++ killed by SIGSEGV +++

Jun 28 '21 00:06 dnivra

Everything between here and the last ptrace syscall would be interesting.

Jun 28 '21 00:06 khuey

Looks to me like the tracee is dying with a SIGSEGV and rr is echoing that. Does the trace that's generated replay properly? Does it die with a SIGSEGV?

Jun 28 '21 00:06 khuey

Here's everything from last ptrace to end of log:

ptrace(PTRACE_CONT, 15794, NULL, 0)     = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=15794, si_uid=1000, si_status=SIGSEGV, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WSTOPPED|__WALL, NULL) = 15794
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=15052, si_uid=1000, si_status=SIGSEGV, si_utime=717, si_stime=312} ---
close(9)                                = 0
close(16)                               = 0
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV && WCOREDUMP(s)}], WSTOPPED|__WALL, NULL) = 15052
munmap(0x7fc591219000, 4096)            = 0
close(13)                               = 0
close(10)                               = 0
close(14)                               = 0
futex(0x26cadec, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x26cad98, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fc590b3d9d0, FUTEX_WAIT, 15041, NULL) = 0
close(3)                                = 0
futex(0x26cb02c, FUTEX_WAKE_PRIVATE, 2147483647) = 8
futex(0x26cafd8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x26cb028, FUTEX_WAKE_PRIVATE, 2147483647) = 7
futex(0x26cafd8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fc58b5fe9d0, FUTEX_WAIT, 15042, NULL) = 0
munmap(0x7fc59033d000, 8392704)         = 0
futex(0x7fc5895fa9d0, FUTEX_WAIT, 15046, NULL) = 0
munmap(0x7fc58adfe000, 8392704)         = 0
munmap(0x7fc58a5fd000, 8392704)         = 0
munmap(0x7fc589dfc000, 8392704)         = 0
munmap(0x7fc5895fb000, 8392704)         = 0
close(4)                                = 0
futex(0x26cbaec, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fc5875f69d0, FUTEX_WAIT, 15050, NULL) = 0
munmap(0x7fc588dfa000, 8392704)         = 0
close(5)                                = 0
futex(0x26cbd4c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fc586cdb9d0, FUTEX_WAIT, 15051, NULL) = 0
munmap(0x7fc5885f9000, 8392704)         = 0
close(6)                                = 0
openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 3
read(3, "D;\360E\262\n\367\262\3057\22\4\2458\224\7", 16) = 16
close(3)                                = 0
write(7, "0D\1P\4\2\21\7\r\1\0370R\2\5\2\1\23\361\4\2021\1\302N\360\377\377\377\377\361\26"..., 854) = 854
rename("/home/dnivra/.local/share/rr/python3-2/incomplete", "/home/dnivra/.local/share/rr/python3-2/version") = 0
close(7)                                = 0
munmap(0x7fc5900a7000, 200704)          = 0
munmap(0x7fc5900d8000, 200704)          = 0
munmap(0x7fc58b5ff000, 10489856)        = 0
munmap(0x7fc590b3e000, 3149824)         = 0
close(12)                               = 0
close(11)                               = 0
close(8)                                = 0
getpid()                                = 15040
openat(AT_FDCWD, "/proc/15040/maps", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "00400000-00801000 r-xp 00000000 "..., 1024) = 1024
read(3, "0-7fc578000000 ---p 00000000 00:"..., 1024) = 1024
read(3, "1 30939243                   /li"..., 1024) = 1024
read(3, "nu/libc-2.31.so\n7fc591035000-7fc"..., 1024) = 1024
read(3, "7fc591076000 r--p 00018000 fd:01"..., 1024) = 1024
read(3, "1219000 r--p 00024000 fd:01 3093"..., 1024) = 674
read(3, "", 1024)                       = 0
close(3)                                = 0
rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=[SEGV], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7fc590e8d210}, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fc59104e420}, 8) = 0
prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
getpid()                                = 15040
kill(15040, SIGSEGV)                    = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=15040, si_uid=1000} ---
+++ killed by SIGSEGV +++

Jun 28 '21 00:06 dnivra

Looks to me like the tracee is dying with a SIGSEGV and rr is echoing that. Does the trace that's generated replay properly? Does it die with a SIGSEGV?

Oh I am sorry I missed writing about this! I mentioned this in the chat but forgot to include it here. When I run the replay, I see the following output:

[FATAL /home/roc/rr/rr/src/ReplaySession.cc:601:enter_syscall()] Expected syscall_bp_vm to be clear but it's 15673's address space with a breakpoint at 0x7ffff7aaf3e9 while we're at 0x7ffff7ab0a8b
=== Start rr backtrace:
rr(_ZN2rr13dump_rr_stackEv+0x28)[0x56d088]
rr(_ZN2rr15notifying_abortEv+0x47)[0x56d107]
rr(_ZN2rr12FatalOstreamD2Ev+0x52)[0x4db232]
rr(_ZN2rr13ReplaySession13enter_syscallEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x189)[0x506889]
rr(_ZN2rr13ReplaySession18try_one_trace_stepEPNS_10ReplayTaskERKNS0_15StepConstraintsE+0x98)[0x509bc8]
rr(_ZN2rr13ReplaySession11replay_stepERKNS0_15StepConstraintsE+0xc8)[0x509de8]
rr(_ZN2rr14ReplayTimeline19replay_step_forwardENS_10RunCommandEl+0xc3)[0x5a8bc3]
rr(_ZN2rr9GdbServer14debug_one_stepERNS_10GdbRequestE+0x34c)[0x4f6a4c]
rr(_ZN2rr9GdbServer12serve_replayERKNS0_15ConnectionFlagsE+0x6cb)[0x5bdd2b]
rr[0x5720a6]
rr(_ZN2rr13ReplayCommand3runERSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x5f1)[0x55a981]
rr(main+0x353)[0x498163]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f75e78cc0b3]
rr(_start+0x29)[0x498579]
=== End rr backtrace
Remote connection closed

Jun 28 '21 00:06 dnivra

I should have included this in the previous message: some part of the program execution is recorded correctly. I see some progress logs from program displayed. After those are replayed, the above backtrace is displayed: it is at the same time when the segmentation fault occurred during the recording.

Edit: Sorry about the error: it actually displays the above backtrace much earlier than when it segfaults during recording. I noticed that just now when comparing the progress logs during recording and replaying

Jun 28 '21 00:06 dnivra

During the recording phase the tracee segfaulted and rr is just echoing that. See line 2

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_DUMPED, si_pid=15794, si_uid=1000, si_status=SIGSEGV, si_utime=0, si_stime=0} ---

The replay crash is different more disturbing. What CPU are you using?

Jun 28 '21 01:06 khuey

During the recording phase the tracee segfaulted and rr is just echoing that.

Hmm I see. Yes I see what you are saying. Then, perhaps my initial assessment that rr is segfaulting is wrong(sorry about that!). It looks like the tracee is segfaulting (prematurely?) due to rr. The traceback displayed during replay made me think it was probably rr that segfaulted.

When I run the tracee outside of rr, it does segfault(that's the bug I set out to debug with rr's help) but at a much later stage that when it's being traced by rr: the tracee prints out logs indicating progress and a lot more is displayed when not being traced by rr. So I guess rr is somehow changing tracee's behaviour?

Edit: rr changing behaviour of trace is probably not an accurate description. This is perhaps more accurate: somehow trace's behaviour is different when being traced in rr compared to when not being traced?

The replay crash is different more disturbing. What CPU are you using?

I am running an Intel Core i7-8850H CPU.

Jun 28 '21 01:06 dnivra

It's certainly plausible that running under rr changes the tracee's behavior.

Is the program you're recording something you can share a trace of?

Jun 29 '21 02:06 khuey

Yes I can share a trace: this is in open source software. How do I export and share the trace? Is it enough to compress the trace directory rr creates and share the archive? Or should I share instructions and code to reproduce this?

Jun 29 '21 04:06 dnivra

You should use rr pack on the trace to bundle in all the libraries, and then you can tar and compress it and share that.

Jun 29 '21 04:06 khuey

The compressed packed trace exceeds the size limits for attachments on Github so I uploaded it to Google drive. Since the archive is quite big, 0d41e7d7fdcf58516f72e4b7578dde4372eefbf1528afadaeca66d517c7fa2a8 is SHA256 of archive for verifying after downloading. I was able to reproduce the issue with replay but in case something is not correct, I can take a look again. A note about the tracee: it is python interpreter running a script which uses a library that has a mix of python and native code.

Jun 29 '21 05:06 dnivra

Ok, it looks like during the replay we expected to hit a getpid() call but instead we ended up at access("/usr/gnemul/qemu-x86_64/usr/lib/locale/locale-archive", 0)

Jun 29 '21 15:06 khuey

And we blew way past the getpid() call too. We should have hit that at ticks=14493667 but we're getting to this access call at ticks=15629784.

Jun 29 '21 15:06 khuey

When we should be at the getpid() we're instead at

0x00007ffff78e1430 in tcg_temp_new_internal ()
(rr) when-ticks
Current tick: 14493667
(rr) bt
#0  0x00007ffff78e1430 in tcg_temp_new_internal ()
#1  0x00007ffff7828224 in i386_tr_init_disas_context ()
#2  0x00007ffff788abd9 in translator_loop ()
#3  0x00007ffff7842616 in gen_intermediate_code ()
#4  0x00007ffff78ab7e5 in tb_gen_code ()
#5  0x00007ffff788ee6c in cpu_exec ()
#6  0x00007ffff7813570 in cpu_loop ()
#7  0x00007ffff780a4dc in main ()

How reproducible is this? Does this die in the same way every time you make a recording?

Jun 29 '21 16:06 khuey

I have been able to reproduce this multiple times. And yes each time the recording aborts with the same error at the same point in execution and the replay of those recordings also abort in the same manner with the backtrace I shared above.

Jun 29 '21 19:06 dnivra

Ok, can you give me steps to reproduce? I think I need to compare the execution outside of rr with the execution inside rr.

Jun 29 '21 19:06 khuey

Sorry about late response. I was able to reproduce rr recording segfault reliably but had issues with execution of the script without rr. I think I have a reliable setup in which script should give same output and the rr issue can also be reproduced. I am seeing some issues with the script's behaviour(it uses features that are WIP) but in this setup, the script's behaviour seems consistent and hence I think should be sufficient to debug rr. I have included log output of script running with and without rr for reference.

rr-repro-2905.zip has all files needed to reproduce this. To simplify things, a docker container is created with everything setup. README.md has instructions for building and running the container. By default, the stable version of rr(currently 5.4.0) is installed. I built the image twice and was able to reproduce the problem so I think this should be reliable enough for debugging rr. If there are any issues reproducing the issue, please see Troubleshooting section in README.md. If those don't help, I can take a look to see what's going wrong.

Jun 30 '21 00:06 dnivra

Thanks. I'm on vacation until July 6th so it might be a while before anybody looks at this.

Jun 30 '21 05:06 khuey

When I run the repro script under rr I get a crash in python in thread_func in libz3.so. It looks like pthread_mutex_timedlock didn't resolve correctly.

Jul 07 '21 03:07 khuey

That was probably fixed by 469c22059a4a1798d33a8a224457faf22b2c178c, so I'll need to update rr in the container.

Jul 07 '21 03:07 khuey

Alright, once I do that, when running under rr I see an error message about "Trace failed to synchronize during fast forward? You might want to unhook LinuxLoader". Replaying a trace with that failure works fine.

Jul 07 '21 04:07 khuey

Ah okay. That wasn't the error when I was testing it. But the particular branch has changed recently so that might explain why error is different. If behaviour of script with and without rr matches, it looks like the issue already fixed in development version.

A follow up question: is a new stable release planned sometime soon or is there a nightly build I can use?

Jul 07 '21 05:07 dnivra

The behavior inside and outside rr is different.

Jul 07 '21 12:07 khuey

I was sort of hoping you'd have some idea what "You might want to unhook LinuxLoader" is about.

Jul 07 '21 13:07 khuey

Ah I see: I misunderstood that things were fine when you said trace replay works fine with that failure.

"Trace failed to synchronize during fast forward? You might want to unhook LinuxLoader"

This error is from the framework used by tracee for analyzing executables(in case you're interested, see http://angr.io/). It simulates (most of) the environment needed by the executable to run including(but not limited to) the loader so that the executable can be analyzed. The error has nothing to do with the actual loader that starts up the tracee; it does look like rr is causing tracee to behave differently.

Jul 07 '21 19:07 dnivra

Right, I can't reproduce the rr crash you've mentioned.

Do you know enough about angr's internals to help me figure out why it's behaving differently under rr? :)

Jul 08 '21 04:07 khuey