sentry-rust icon indicating copy to clipboard operation
sentry-rust copied to clipboard

fix(sentry-tracing): switch sentry spans on enter and exit

Open saiintbrisson opened this issue 1 year ago • 7 comments

I've been facing a bug with Sentry Trace where concurrent spans end up wrongly nested. At first, I thought this was some strange autogrouping bug and event sent a feedback (sorry Sentry support member 😅).

After digging a little bit, I discovered that the current tracing layer implementation sets the sentry span as active as soon as a tracing span is created, which should not be the case, as spans are only active once entered. Following the same logic, it also reverts to the parent span only when a span is closed, and not when it's exited.

A span can be entered and exited multiple times, for example, when a tracing Instrumented future polls and returns Pending. In case of concurrent Instrumented futures, the current impl ends up nesting Futures under the first created one.

sentry-tracing layer (wrong)

Screenshot 2024-09-25 at 14 22 57

tracing-opentelemetry layer (correct)

Screenshot 2024-09-25 at 14 28 11

I'd say that the best idea would be to not use sentry_core::configure_scope(|s| s.get_span()); when creating a new span to get the parent, and get it from the span extensions instead, but the current behavior might be relied upon already by someone.

saiintbrisson avatar Sep 25 '24 17:09 saiintbrisson

In case of concurrent Instrumented futures, the current impl ends up nesting Futures under the first created one.

For those cases, one should usually fork the Sentry Hub in order to have them fully isolated.

While you are right that enter/exit is a better fit, and when someone is manually hooking up spans, this distinction can lead to some confusion. However, Hub isolation is the proper solution to this. Hooking into enter/exit in order to set/unset the active Span in the Sentry scope would not fully fix the problem of concurrently running futures.

Swatinem avatar Sep 30 '24 08:09 Swatinem

Hi @saiintbrisson, are you able to solve this problem with hub isolation?

I would honestly be a bit hesitant to change how we enter and exit Sentry spans, since others might be relying on the current behavior. But, we can look into it if you are unable to find a way to work around this problem.

szokeasaurusrex avatar Oct 01 '24 12:10 szokeasaurusrex

Hi @saiintbrisson, are you able to solve this problem with hub isolation?

I would honestly be a bit hesitant to change how we enter and exit Sentry spans, since others might be relying on the current behavior. But, we can look into it if you are unable to find a way to work around this problem.

I ended up just using my fork at my company. Hub isolation is pretty verbose as I would have to bind it to every future as it currently stand. I already isolate it for my root future (I have a Axum middleware binding a hub to the handler call).

The problem arises on child futures, like join!ing two function futures. If both functions are tracing::instrumented, the first one will create a span when first polled, and when the second one is polled, its span will be considered a child to the first function, because the first span will only end after it returns Poll::Ready, which seems objectively wrong to me, as shown in the prints. That is, tracing::Span::current is not the same one as sentry::configure_scope(|s| s.get_span()), and I find this rather confusing.

Probably manually binding a hub to every call inside my join would work, but I don't want to make my coworkers handle this kind of detail during development. We have a couple of hundreds of mentions of try_join! on our code, and general code is agnostic to our telemetry stack (everything is handled through tracing).

If this happens to be a wont-fix, I'll probably just create my own tracing layer with this impl, but I do think sentry-tracing would benefit from this fix.

saiintbrisson avatar Oct 02 '24 01:10 saiintbrisson

Hub isolation is pretty verbose as I would have to bind it to every future as it currently stand.

Yes that is a bit unfortunate, but ultimately the correct thing to do. Apart from proper span hierarchy, this also relates to Sentrys Scope and other concepts, which could go out of sync if multiple concurrent futures are messing with them.

Swatinem avatar Oct 02 '24 07:10 Swatinem

Hub isolation is pretty verbose as I would have to bind it to every future as it currently stand.

Yes that is a bit unfortunate, but ultimately the correct thing to do. Apart from proper span hierarchy, this also relates to Sentrys Scope and other concepts, which could go out of sync if multiple concurrent futures are messing with them.

I don't think I understand how Scopes are to be used. I currently create a new Hub, create a transaction and set it to the created Hub, that I than use to bind it to the handler call (one scope per request). The way you described it, it seems like every future should have its own Scope?

In any case, I expected sentry-tracing to behave like what tracing-subscriber defines, thus my confusion as someone familiar with other impls (like tracing-opentelemetry). Because I use the tracing layer, I was wishing I could forget sentry Scopes altogether in favor of tracing spans/events/instrumentation. The current behavior should probably be described in the README for the crate: https://github.com/getsentry/sentry-rust/tree/master/sentry-tracing#tracing-spans, it's not clear that running these futures concurrently will provoke unexpected (faulty?) results.

Let me know if you folks come to any conclusion about this PR. For now, I'll create my own layer impl and move the fn get_span_idto #687.

Thanks for your time and keep up the good work, we've been enjoying Sentry a lot around here :).

saiintbrisson avatar Oct 02 '24 16:10 saiintbrisson

All in all, thank you for the great feedback, this is very valuable.

I don't think I understand how Scopes are to be used. I currently create a new Hub, create a transaction and set it to the created Hub, that I than use to bind it to the handler call (one scope per request). The way you described it, it seems like every future should have its own Scope?

I might have added a bit to the confusion here. The Rust SDK still has a couple of concepts that have since been removed from the wider "Sentry unified API". For example, there are no Hubs anymore, and no Transactions either. There is just Scopes and Spans. Though the Rust SDK is unfortunately quite behind on all of this :-(

Anyhow, in essence you want every future to have its own Scope (which in the Rust SDK is still named Hub) as you want to have isolation between them, for example:

  • When you set tags like file_id, you want those to be unique to each concurrently running future, as any error being logged should have the correct file_id.
  • Similarly, log messages which are being converted to breadcrumbs should also be isolated between concurrent futures. Otherwise a log line like "Starting to process file X" would be intermingled with other breadcrumbs unrelated to the operation.

The #[tracing] macro conveniently wraps the resulting future in Instrumented as you noted, which makes sure to push/pop the currently active span to a thread local. Sentry is completely outside of that, so you would have to do the equivalent for the Sentry Hub manually. Which then does the same management of a thread-local with appropriate state.

As I mentioned, the Rust SDK is quite a bit behind in terms of features. By now, the Sentry JS SDK has fully embraced opentelemetry, the SDK and instrumentation side that is. I could maybe imagine Sentry doing something similar in the sense of not duplicating all this span tracking, etc, but fully embracing tracing and build on top of that. Though that would be a ton of work. :-D

Swatinem avatar Oct 03 '24 07:10 Swatinem

I think this is useful even outside of Futures.

My application doesn't do any async but whenever we create a new thread we also create a new tracing span for that thread. The span is created before the new thread (to inherit the current span as the parent) but then only enter it inside the new thread. If I am understanding this issue correctly, currently any events between the creation of the span and the entering of the span will have the wrong span information. While I understand this is a change in behavior I think the current behavior is fairly unexpected and not intuitive.

nrxus avatar Oct 03 '24 20:10 nrxus

I just noticed that my PR https://github.com/getsentry/sentry-rust/pull/700 does the same thing

So I might as well close it, but I would like to point few obvious things with the way sentry does things. Sentry hubs are stored in static variables so that's actual root of the issue for tracing implementation

Moving scope change to enter/exit will solve majority of issues aside from multi threaded environments. To work it around you can also just pin tasks to threads (work stealing runtimes were a mistake) or use bind_hub https://docs.rs/sentry/0.34.0/sentry/trait.SentryFutureExt.html#method.bind_hub

But you cannot make tracing span works without changing scope setting to the on enter/exit

DoumanAsh avatar Oct 30 '24 09:10 DoumanAsh

I just noticed that my PR #700 does the same thing

So I might as well close it, but I would like to point few obvious things with the way sentry does things. Sentry hubs are stored in static variables so that's actual root of the issue for tracing implementation

Moving scope change to enter/exit will solve majority of issues aside from multi threaded environments. To work it around you can also just pin tasks to threads (work stealing runtimes were a mistake) or use bind_hub docs.rs/sentry/0.34.0/sentry/trait.SentryFutureExt.html#method.bind_hub

But you cannot make tracing span works without changing scope setting to the on enter/exit

I'm with you. Though this might not fix everything, it surely fixes most of it. I've been using my fork at my employer and haven't really had any issues up to now.

saiintbrisson avatar Oct 30 '24 11:10 saiintbrisson

I looked over code again and how bind_hub works And I think tracing layer could use the same Switch guard https://github.com/getsentry/sentry-rust/blob/master/sentry-core/src/hub_impl.rs#L24

Hub could be stored together with span and on every enter, you could create guard, and then drop guard on every exit

This should ensure tracing span will retain the same hub across any thread, if I understood its purpose properly

TL;DR Extend SentrySpanData as follows:

pub(super) struct SentrySpanData {
    pub(super) sentry_span: TransactionOrSpan,
    parent_sentry_span: Option<TransactionOrSpan>,
    creation_hub: Arc<sentry_core::Hub>,
    //Set to Some() on enter, set to None in on_exit
    switch_guard: Option<sentry_core::hub_impl::SwitchGuard>
}

DoumanAsh avatar Oct 30 '24 12:10 DoumanAsh

@saiintbrisson Can you maybe rebase your work on current master? I'm thinking to use patched version of sentry and maybe will make PR against yours to add Switch hub guard

DoumanAsh avatar Oct 31 '24 00:10 DoumanAsh

I created prototype of my idea: https://github.com/saiintbrisson/sentry-rust/pull/1

It will capture Hub on span creation and create switch guard on every enter of the span while dropping guard in exit

@Swatinem I believe it should solve all possible concerns with this PR

DoumanAsh avatar Oct 31 '24 03:10 DoumanAsh

Codecov Report

Attention: Patch coverage is 79.41176% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.82%. Comparing base (605d36d) to head (8f139e3).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #686      +/-   ##
==========================================
+ Coverage   72.79%   72.82%   +0.02%     
==========================================
  Files          66       66              
  Lines        7804     7830      +26     
==========================================
+ Hits         5681     5702      +21     
- Misses       2123     2128       +5     

codecov[bot] avatar Nov 28 '24 09:11 codecov[bot]

@saiintbrisson Can you apply comments on your PR?

DoumanAsh avatar Nov 28 '24 10:11 DoumanAsh

CI checks are also still unhappy :-(

Swatinem avatar Nov 28 '24 10:11 Swatinem

use of deprecated type alias std::panic::PanicInfo: use PanicHookInfo instead

This is due to Rust team introducing unnecessary lints with each version

Do you want to silence it? Using PanicHookInfo requires 1.81

Nevermind, we need to rebase this branch on master https://github.com/getsentry/sentry-rust/commit/9b82c90505763b253864e937aa6f6f0abae712a9

DoumanAsh avatar Nov 28 '24 10:11 DoumanAsh

Hi! Yeah, I'll push fixes in a few hours, thanks for the patience 😅

saiintbrisson avatar Nov 28 '24 14:11 saiintbrisson

@Swatinem I think this is good to go now. Thanks again for your time and the team for the amazing product.

And thanks @DoumanAsh for the support and getting it to the final solution.

saiintbrisson avatar Nov 28 '24 15:11 saiintbrisson

Thank you for your patience and persistence here :-)

Swatinem avatar Nov 29 '24 08:11 Swatinem