debug: replace RtlCaptureStackBackTrace (which was spuriously failing) with a new implementation which uses RtlVirtualUnwind instead
This PR implements std.debug.walkStackWindows as a replacement for RtlCaptureStackBackTrace.
Context
As part of debugging #12703 I was investigating why stack traces weren't appearing, and realized a simple test program that just panicked would not always print the stack trace.
Note that I'm using zig2 here as I can't build stage3 due to #12703.
Example program:
fn bar() void {
@panic("test");
}
fn foo() void {
bar();
}
pub fn main() void {
foo();
}
>zig2 build-exe panic.zig -target native-windows-msvc
>panic.exe
thread 24112 panic: test
>panic.exe
thread 6572 panic: test
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:2:5: 0x7ff7c6e31d7f in bar (panic.exe.obj)
@panic("test");
^
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:6:8: 0x7ff7c6e315ce in foo (panic.exe.obj)
bar();
^
c:\cygwin64\home\kcbanner\temp\zig_panic\panic.zig:10:8: 0x7ff7c6e3128e in main (panic.exe.obj)
foo();
^
C:\cygwin64\home\kcbanner\kit\zig\lib\std\start.zig:345:41: 0x7ff7c6e31247 in WinStartup (panic.exe.obj)
std.debug.maybeEnableSegfaultHandler();
^
???:?:?: 0x7ffac2397033 in ??? (???)
???:?:?: 0x7ffac2ce2650 in ??? (???)
>panic.exe
thread 3864 panic: test
>panic.exe
thread 19984 panic: test
The reason for this is that RtlCaptureStackBackTrace spuriously returns 0. I couldn't determine the root cause of the spurious failures, but it seemed that others had run into this (https://developercommunity.visualstudio.com/t/capturestackbacktrace-randomly-fails-after-initial/1383213) and had gotten around it by unwinding the stack via the RtlLookupFunctionEntry/RtlVirtualUnwind calls.
I referenced these docs for the unwind process:
- https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=msvc-170#unwind-procedure
- http://www.nynaeve.net/?p=101
Supporting changes:
- added
RtlCaptureContext,RtlLookupFunctionEntry,RtlVirtualUnwindand supporting types - fix alignment of
CONTEXTstructs to match winnt.h as required byRtlCaptureContext(fxsaveinstr)- If a
CONTEXTwas put onto the stack and happened to not be aligned to 16, then thefxsaveinstruction would cause an access violation
- If a
- windows aarch64: fix __chkstk being defined twice if libc is not linked on msvc
- I'm not familiar with this system, but I was hitting this compile error with
-target aarch64-windows-msvc
- I'm not familiar with this system, but I was hitting this compile error with
Testing
I'm unsure how to test on windows aarch64. I don't have the hardware - and while the code compiles against -target aarch64-windows-msvc, I can't be sure it functions. It seems like qemu for windows doesn't support the same userspace emulation as linux for running aarch64 binaries.
When running zig2 build test -Dskip-release, I get failures such as:
compiler [51/1153] compile_error_in_inline_fn_call_fixed (stage2, aarch64-macos) [2/2] update [2... error.Unexpected NTSTATUS=0xc0000056
c:\cygwin64\home\kcbanner\kit\zig\lib\std\debug.zig:608:31: 0x7ff77d15ff6f in writeCurrentStackTraceWindows__anon_57536 (test.exe.obj)
const n = walkStackWindows(addr_buf[0..]);
^
c:\cygwin64\home\kcbanner\kit\zig\lib\std\debug.zig:552:45: 0x7ff77d15fe8d in writeCurrentStackTrace__anon_57535 (test.exe.obj)
return writeCurrentStackTraceWindows(out_stream, debug_info, tty_config, start_addr);
^
However, these also seem present on master, just with the old stack trace function. This seems like #12420?
I've made my best efforts to follow the style/contributing guidelines and but please let me know if I've missed something as this is my first PR here. I'm also happy to defer this until I can figure out #12703 and properly run the tests.
When it comes to testing, currently we don't support running stage2 incremental linking tests on Windows, so you will need to pass -Dskip-stage2-tests to your test run. I'm in the process of finishing up enough of the in-house COFF/PE linker that soon this will not be needed so stay tuned!
Thanks for the feedback! I'll take a look at those ReactOS implementations
Ah, good catches - fixed. I'll work on the proposal we talked about for the COFF unwind info once this is merged 👍
I believe something has regressed here with LLVM 15. I was testing this against latest master and the 64 bit path is broken both on master (old method) and with these changes. On this branch, RtlLookupFunctionEntry returns null.
Using a zig re-implementation of RtlVirtualUnwind that I've been working on, I can see that the coff.DirectoryEntry.EXCEPTION directory has a size of zero. This directory in the COFF is supposed to contain entries which have the start/end address of each function, so that you can map an IP back to a function (this is what RtlLookupFunctionEntry does).
My guess is something changed with the LLVM 15 update that caused the linker not to include this directory anymore (and this is a debug build). In LLV14, this section was present.
Regardless, this PR isn't ready to merge until I figure out why this directory isn't there anymore.
This is confirmed by dumpbin:
The LLVM15 compiled version has a missing .pdata section (which corresponds to the exception directory entry, which is also 0).
exe compiled in debug mode with with zig that used LLVM 14.0.6:
Dump of file comments.exe
PE signature found
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
8664 machine (x64)
7 number of sections
630B005A time date stamp Sun Aug 28 01:42:50 2022
0 file pointer to symbol table
0 number of symbols
F0 size of optional header
22 characteristics
Executable
Application can handle large (>2GB) addresses
OPTIONAL HEADER VALUES
20B magic # (PE32+)
14.00 linker version
72C00 size of code
44200 size of initialized data
0 size of uninitialized data
1230 entry point (0000000140001230) wWinMainCRTStartup
1000 base of code
140000000 image base (0000000140000000 to 00000001400BDFFF)
1000 section alignment
200 file alignment
6.00 operating system version
0.00 image version
6.00 subsystem version
0 Win32 version
BE000 size of image
400 size of headers
0 checksum
3 subsystem (Windows CUI)
8160 DLL characteristics
High Entropy Virtual Addresses
Dynamic base
NX compatible
Terminal Server Aware
1000000 size of stack reserve
1000 size of stack commit
100000 size of heap reserve
1000 size of heap commit
0 loader flags
10 number of directories
0 [ 0] RVA [size] of Export Directory
B2598 [ 3C] RVA [size] of Import Directory
0 [ 0] RVA [size] of Resource Directory
B8000 [ 2610] RVA [size] of Exception Directory
0 [ 0] RVA [size] of Certificates Directory
BD000 [ 170] RVA [size] of Base Relocation Directory
B2538 [ 1C] RVA [size] of Debug Directory
0 [ 0] RVA [size] of Architecture Directory
0 [ 0] RVA [size] of Global Pointer Directory
B2510 [ 28] RVA [size] of Thread Storage Directory
0 [ 0] RVA [size] of Load Configuration Directory
0 [ 0] RVA [size] of Bound Import Directory
B26E0 [ 108] RVA [size] of Import Address Table Directory
0 [ 0] RVA [size] of Delay Import Directory
0 [ 0] RVA [size] of COM Descriptor Directory
0 [ 0] RVA [size] of Reserved Directory
exe compiled with latest zig master, LLVM 15.0.0:
Dump of file comments_llvm15.exe
PE signature found
File Type: EXECUTABLE IMAGE
FILE HEADER VALUES
8664 machine (x64)
6 number of sections
632FCC2B time date stamp Sat Sep 24 23:34:03 2022
0 file pointer to symbol table
0 number of symbols
F0 size of optional header
22 characteristics
Executable
Application can handle large (>2GB) addresses
OPTIONAL HEADER VALUES
20B magic # (PE32+)
14.00 linker version
73800 size of code
43600 size of initialized data
0 size of uninitialized data
1280 entry point (0000000140001280) wWinMainCRTStartup
1000 base of code
140000000 image base (0000000140000000 to 00000001400BCFFF)
1000 section alignment
200 file alignment
6.00 operating system version
0.00 image version
6.00 subsystem version
0 Win32 version
BD000 size of image
400 size of headers
0 checksum
3 subsystem (Windows CUI)
8160 DLL characteristics
High Entropy Virtual Addresses
Dynamic base
NX compatible
Terminal Server Aware
1000000 size of stack reserve
1000 size of stack commit
100000 size of heap reserve
1000 size of heap commit
0 loader flags
10 number of directories
0 [ 0] RVA [size] of Export Directory
B774F [ 3C] RVA [size] of Import Directory
0 [ 0] RVA [size] of Resource Directory
0 [ 0] RVA [size] of Exception Directory
0 [ 0] RVA [size] of Certificates Directory
BC000 [ 18C] RVA [size] of Base Relocation Directory
B76E8 [ 1C] RVA [size] of Debug Directory
0 [ 0] RVA [size] of Architecture Directory
0 [ 0] RVA [size] of Global Pointer Directory
B76C0 [ 28] RVA [size] of Thread Storage Directory
0 [ 0] RVA [size] of Load Configuration Directory
0 [ 0] RVA [size] of Bound Import Directory
B78A8 [ 118] RVA [size] of Import Address Table Directory
0 [ 0] RVA [size] of Delay Import Directory
0 [ 0] RVA [size] of COM Descriptor Directory
0 [ 0] RVA [size] of Reserved Directory
The regression above is fixed with https://github.com/ziglang/zig/pull/12959
First of, I'd like to apologise for taking so long to take another look at your PR! In general, I believe we are ready to merge except the merge conflicts, and so, would you mind rebasing on latest master so that we can follow up with a merge?
No problem, I'll get this updated soon 👍
Ok, rebased onto master 👍
Personally I feel like it is an improvement, because it did solve my own issue, which was that I wasn't getting stack traces reliably. I understand the syscall may be a deal breaker though.
There is some more context in this thread around why I opted to use the kernel32 call: https://discord.com/channels/605571803288698900/1017808096682311740/1018889803128918056
Essentially, I did write an implementation of RtlVirtualUnwind in zig, that parses all the unwind table opcodes from the unwind info to figure out how to walk back up the stack. I referenced the ReactOS implementation as part of that. I managed to implement that completely without any syscalls, except for a critical section call to lock around the list of loaded modules from the PEB.
The implementation (WIP: working on x64, but not cleaned up): https://gist.github.com/kcbanner/61058fba639fca1a6bda13773489a484
So while that does work, you still have to make the kernel32 call, otherwise you risk reading the list while it's being changed. Other than that, it can walk the stack completely without any syscalls. If we are ok with just not locking that critical section, then this could be done entirely in zig without syscalls. There would be potential for crashes if the module list is modified while it's being iterated though. With that in mind (and the use of a bunch of undocumented data structures), I stopped with the custom implementation since we needed a kernel32 call anyway.
I'm happy to explore that further, but we'd still need that kernel32 call to grab the lock. If we ignore the lock then we can't safely iterate that linked list, and if we cache results then there is a TOCTOU issue.
From what I could tell the ReactOS version of RtlCaptureStackBackTrace uses RtlWalkFrameChain (https://doxygen.reactos.org/dd/df8/dll_2ntdll_2rtl_2libsupp_8c.html#a703c006cc016ffffe4d2982c373711d4), which just uses the frame pointer to walk the state. So, that works on x86 and on x64 if there was no extra stack space allocated. However, if there was - then you need to use the unwind info to figure out what the prologue did in order to reverse it.
I think the Microsoft impl is probably different than this, since it did work sometimes. So I could run through the disasm of that to see what's actually going on in order to debug the original spurious failure - but I didn't do that investigation yet.
Thank you for this writeup. I appreciate your patience with me here.
In NtDll.dll, I see the following functions:
-
RtlCaptureContext -
RtlCaptureContext2 -
RtlLookupFunctionEntry -
RtlVirtualUnwind -
RtlVirtualUnwind2 -
RtlEnterCriticalSection -
RtlLeaveCriticalSection
After reading your writeup, and making note of these functions, I'm in favor of your changes, but I still want to emphasize #1840. Our goal is to transition to NtDll being the only dependency of the standard library. It seems compatible with your changes. Does that sound reasonable to you?
No problem, and yes that make sense to me!
I actually had misunderstood and didn't realize kernel32 wrapped ntdll, I thought they were independent, since the MSDN docs always refer to kernel32 for things like RtlVirtualUnwind. Reading https://github.com/ziglang/zig/issues/1840 now everything makes more sense.
Updated to call via ntdll instead of kernel32.
Should I keep the definitions of these functions in kernel32.zig for users to call, or I remove them now that they exist in ntdll.zig as well?
Should I keep the definitions of these functions in kernel32.zig for users to call, or I remove them now that they exist in ntdll.zig as well?
This question can be left for #4426 to solve
Great work! Thanks for the follow ups.