`<stacktrace>`: investigate sporadic failure of `CaptureStackBackTrace`
When trying to solve #3859, I find that stacktrace::current can fail sporadically.
Repro (benchmark)
#include <benchmark/benchmark.h>
#include <iostream>
#include <stacktrace>
using namespace std;
#pragma optimize("", off)
stacktrace get_current(int extra_depth) {
if (extra_depth > 0) {
return get_current(extra_depth - 1);
}
else {
return stacktrace::current(1);
}
}
#pragma optimize("", on)
void BM_bef(benchmark::State& state) {
const int extra_depth = state.range(0);
for (auto _ : state) {
stacktrace cur = get_current(extra_depth);
if (cur.size() < extra_depth) {
cout << extra_depth << "->" << cur.size() << endl;
terminate();
}
}
}
BENCHMARK(BM_bef)->Arg(0)->Arg(100)->Arg(150)->Arg(300);
BENCHMARK_MAIN();
Potential result
After investigation, it turns out the failure is not due to memory exception in current. The function rely on CaptureStackBackTrace to catch frames, which just can fail sporadically:
https://github.com/microsoft/STL/blob/f51733ca5352c712165e46fdbc5d0d395971f0e7/stl/src/stacktrace.cpp#L252
Repro (CaptureStackBackTrace)
#include <Windows.h>
#include <iostream>
void* frames[1024];
#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
if (extra_depth > 0) {
return catch_some_frames(extra_depth - 1);
}
else {
return CaptureStackBackTrace(0, 1024, frames, nullptr);//sporadic crash: "300->0"
}
}
#pragma optimize("", on)
int main() {
for (;;) {
auto caught = catch_some_frames(300);
if (caught < 300) {
std::cout << 300 << "->" << caught << std::endl;
terminate();
}
else {
std::cout << "*";
}
}
}
Potential result
CaptureStackBackTrace is a macro for RtlCaptureStackBackTrace. As to it randomly returning 0, I find these similar issues:
https://github.com/milostosic/MTuner/issues/66
https://developercommunity.visualstudio.com/t/capturestackbacktrace-randomly-fails-after-initial/1383213
And there might be other problems with it:
https://developercommunity.visualstudio.com/t/passing-framestoskip-greater-than-254-to-rtlcaptur/548661
Both the sporadic fail and skip-clamp behavior is not documented in its (documentation) (2).
And its API design is the root case for basic_stacktrace::current's efficiency problem.
We need to investigate what's going on with CaptureStackBackTrace when it returns 0. Also, we might be better off exploring other ways to capture stack frames, including implementing our own capture function in STL library. If possible, this will also give a chance to solve #3859 as a by-catch.
OS-37394205 "RtlCaptureStackBackTrace returns empty back trace" was wontfixed.
Perhaps we need (ugh) retry logic, if it simply returns 0 with a low failure rate.
Test
#define _CRT_SECURE_NO_WARNINGS
#include <Windows.h>
#include <stdio.h>
#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
static void* frames[4096];
if (extra_depth > 0) {
return catch_some_frames(extra_depth - 1);
}
else {
int retry = 0;
while (!CaptureStackBackTrace(0, 4096, frames, nullptr)) {
++retry;
}
return retry;
}
}
#pragma optimize("", on)
int main() {
srand(0);
FILE* fp = fopen("log.txt", "a");
int _0s = 0;
for (int i = 0; i < 2000; i++) {
int retry = catch_some_frames(rand() % 1000);
if (retry) {
if (_0s) {
fprintf(fp, "%d", _0s);
}
fprintf(fp, " (%d) ", retry);
_0s = 0;
}
else {
++_0s;
}
}
if (_0s) {
fprintf(fp, "%d", _0s);
}
fputc('\n', fp);
}
It seems that the failure has a low rate, and typically happens early. I haven't yet seen '2' when testing.
Results of 100 runs (notice that there are a lot of identical/similar results)
4 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 1998
15 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
15 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 1984
15 (1) 2 (1) 1981
4 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 1995
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
1 (1) 1998
1 (1) 1998
1 (1) 1998
16 (1) 1 (1) 1981
1 (1) 1998
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
1 (1) 2 (1) 1995
1 (1) 2 (1) 1995
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
18 (1) 1981
4 (1) 10 (1) 1984
4 (1) 10 (1) 1984
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
15 (1) 2 (1) 1981
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
1 (1) 1998
1 (1) 1998
4 (1) 10 (1) 1984
18 (1) 1981
16 (1) 1 (1) 1981
18 (1) 1981
4 (1) 10 (1) 1984
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 1995
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
(What did they say in OS-37394205? I'm getting 401 - Uh-oh, you do not have access.)
Yeah, that's the Windows-internal bug database, apologies for not mentioning that. The bug report essentially described what we've observed; there was no further discussion when it was wontfixed :crying_cat_face:
So, when RtlCaptureStackBackTrace returns 0, it may due to:
- A sporadic failure happens.
-
FramesToSkipis larger than the number of actual frames. -
FramesToSkipis larger than an undocumented value.
Are there even ways to detect whether the failure happens? (We can set FramesToSkip to 0 to avoid 2 and 3, but this means the hash functionality will no longer be available)
OS-37394205 "RtlCaptureStackBackTrace returns empty back trace" was wontfixed.
Perhaps this needs to be revised as the impact of the issue grows with the STL feature addition, and it cannot be reliably fixed by the caller.
FramesToSkipis larger than an undocumented value.
Notice that this is another, independent problem. It is confirmable by the following code. If the sporadic failure in RtlCaptureStackBackTrace is to be fixed, I think this should be fixed as well.
#undef NDEBUG
#include <Windows.h>
#include <iostream>
#include <cassert>
void* frames[1024];
#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
if (extra_depth > 0) {
return catch_some_frames(extra_depth - 1);
}
else {
return CaptureStackBackTrace(/*skip=*/257, 1024, frames, nullptr);
}
}
#pragma optimize("", on)
int main() {
for (;;) {
auto caught = catch_some_frames(300);
// assert caught != 0 if the sporadic failure didn't happen. however, due to the issue:
assert(caught == 0);
}
}
Reported as one of DevCom-10692305 problems
In case it helps with prioritization: this is affecting me in real shipping applications. I'm going to have to move away from std::stacktrace to some inline assembly, which is a bit disappointing.
It was particularly time-consuming and frustrating to track down as for whatever reason, this does not seem to be a problem when a debugger is attached.