STL icon indicating copy to clipboard operation
STL copied to clipboard

`<stacktrace>`: investigate sporadic failure of `CaptureStackBackTrace`

Open achabense opened this issue 2 years ago • 7 comments

When trying to solve #3859, I find that stacktrace::current can fail sporadically.

Repro (benchmark)
#include <benchmark/benchmark.h>
#include <iostream>
#include <stacktrace>

using namespace std;

#pragma optimize("", off)
stacktrace get_current(int extra_depth) {
    if (extra_depth > 0) {
        return get_current(extra_depth - 1);
    }
    else {
        return stacktrace::current(1);
    }
}
#pragma optimize("", on)

void BM_bef(benchmark::State& state) {
    const int extra_depth = state.range(0);
    for (auto _ : state) {
        stacktrace cur = get_current(extra_depth);
        if (cur.size() < extra_depth) {
            cout << extra_depth << "->" << cur.size() << endl;
            terminate();
        }
    }
}

BENCHMARK(BM_bef)->Arg(0)->Arg(100)->Arg(150)->Arg(300);

BENCHMARK_MAIN();
Potential result

image

image

After investigation, it turns out the failure is not due to memory exception in current. The function rely on CaptureStackBackTrace to catch frames, which just can fail sporadically: https://github.com/microsoft/STL/blob/f51733ca5352c712165e46fdbc5d0d395971f0e7/stl/src/stacktrace.cpp#L252

Repro (CaptureStackBackTrace)
#include <Windows.h>
#include <iostream>

void* frames[1024];

#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
    if (extra_depth > 0) {
        return catch_some_frames(extra_depth - 1);
    }
    else {
        return CaptureStackBackTrace(0, 1024, frames, nullptr);//sporadic crash: "300->0"
    }
}
#pragma optimize("", on)

int main() {
    for (;;) {
        auto caught = catch_some_frames(300);
        if (caught < 300) {
            std::cout << 300 << "->" << caught << std::endl;
            terminate();
        }
        else {
            std::cout << "*";
        }
    }
}
Potential result

image image

CaptureStackBackTrace is a macro for RtlCaptureStackBackTrace. As to it randomly returning 0, I find these similar issues: https://github.com/milostosic/MTuner/issues/66 https://developercommunity.visualstudio.com/t/capturestackbacktrace-randomly-fails-after-initial/1383213 And there might be other problems with it: https://developercommunity.visualstudio.com/t/passing-framestoskip-greater-than-254-to-rtlcaptur/548661 Both the sporadic fail and skip-clamp behavior is not documented in its (documentation) (2). And its API design is the root case for basic_stacktrace::current's efficiency problem.

We need to investigate what's going on with CaptureStackBackTrace when it returns 0. Also, we might be better off exploring other ways to capture stack frames, including implementing our own capture function in STL library. If possible, this will also give a chance to solve #3859 as a by-catch.

achabense avatar Jul 19 '23 16:07 achabense

OS-37394205 "RtlCaptureStackBackTrace returns empty back trace" was wontfixed.

Perhaps we need (ugh) retry logic, if it simply returns 0 with a low failure rate.

StephanTLavavej avatar Jul 27 '23 01:07 StephanTLavavej

Test
#define _CRT_SECURE_NO_WARNINGS
#include <Windows.h>
#include <stdio.h>

#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
    static void* frames[4096];
    if (extra_depth > 0) {
        return catch_some_frames(extra_depth - 1);
    }
    else {
        int retry = 0;
        while (!CaptureStackBackTrace(0, 4096, frames, nullptr)) {
            ++retry;
        }
        return retry;
    }
}
#pragma optimize("", on)

int main() {
    srand(0);
    FILE* fp = fopen("log.txt", "a");
    int _0s = 0;
    for (int i = 0; i < 2000; i++) {
        int retry = catch_some_frames(rand() % 1000);
        if (retry) {
            if (_0s) {
                fprintf(fp, "%d", _0s);
            }
            fprintf(fp, " (%d) ", retry);
            _0s = 0;
        }
        else {
            ++_0s;
        }
    }
    if (_0s) {
        fprintf(fp, "%d", _0s);
    }
    fputc('\n', fp);
}

It seems that the failure has a low rate, and typically happens early. I haven't yet seen '2' when testing.

Results of 100 runs (notice that there are a lot of identical/similar results)
4 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 1998
15 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
15 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 1984
15 (1) 2 (1) 1981
4 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 1995
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
1 (1) 1998
1 (1) 1998
1 (1) 1998
16 (1) 1 (1) 1981
1 (1) 1998
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
1 (1) 2 (1) 1995
1 (1) 2 (1) 1995
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 1984
18 (1) 1981
4 (1) 10 (1) 1984
4 (1) 10 (1) 1984
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
4 (1) 10 (1) 2 (1) 1981
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 10 (1) 1984
15 (1) 2 (1) 1981
18 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
1 (1) 1998
1 (1) 1998
4 (1) 10 (1) 1984
18 (1) 1981
16 (1) 1 (1) 1981
18 (1) 1981
4 (1) 10 (1) 1984
1 (1) 1998
1 (1) 2 (1) 10 (1) 1984
1 (1) 2 (1) 1995
4 (1) 10 (1) 2 (1) 1981
1 (1) 2 (1) 10 (1) 1984
1 (1) 1998
4 (1) 10 (1) 2 (1) 1981
18 (1) 1981

(What did they say in OS-37394205? I'm getting 401 - Uh-oh, you do not have access.)

achabense avatar Jul 27 '23 16:07 achabense

Yeah, that's the Windows-internal bug database, apologies for not mentioning that. The bug report essentially described what we've observed; there was no further discussion when it was wontfixed :crying_cat_face:

StephanTLavavej avatar Jul 27 '23 19:07 StephanTLavavej

So, when RtlCaptureStackBackTrace returns 0, it may due to:

  1. A sporadic failure happens.
  2. FramesToSkip is larger than the number of actual frames.
  3. FramesToSkip is larger than an undocumented value.

Are there even ways to detect whether the failure happens? (We can set FramesToSkip to 0 to avoid 2 and 3, but this means the hash functionality will no longer be available)

achabense avatar Jul 28 '23 10:07 achabense

OS-37394205 "RtlCaptureStackBackTrace returns empty back trace" was wontfixed.

Perhaps this needs to be revised as the impact of the issue grows with the STL feature addition, and it cannot be reliably fixed by the caller.

AlexGuteniev avatar Aug 11 '23 09:08 AlexGuteniev

  1. FramesToSkip is larger than an undocumented value.

Notice that this is another, independent problem. It is confirmable by the following code. If the sporadic failure in RtlCaptureStackBackTrace is to be fixed, I think this should be fixed as well.

#undef NDEBUG
#include <Windows.h>
#include <iostream>
#include <cassert>
void* frames[1024];

#pragma optimize("", off)
long catch_some_frames(int extra_depth) {
    if (extra_depth > 0) {
        return catch_some_frames(extra_depth - 1);
    }
    else {
        return CaptureStackBackTrace(/*skip=*/257, 1024, frames, nullptr);
    }
}
#pragma optimize("", on)

int main() {
    for (;;) {
        auto caught = catch_some_frames(300);
        // assert caught != 0 if the sporadic failure didn't happen. however, due to the issue:
        assert(caught == 0);
    }
}

achabense avatar Aug 12 '23 00:08 achabense

Reported as one of DevCom-10692305 problems

AlexGuteniev avatar Jun 27 '24 07:06 AlexGuteniev

In case it helps with prioritization: this is affecting me in real shipping applications. I'm going to have to move away from std::stacktrace to some inline assembly, which is a bit disappointing.

It was particularly time-consuming and frustrating to track down as for whatever reason, this does not seem to be a problem when a debugger is attached.

fredemmott avatar Jan 16 '25 22:01 fredemmott