Starfish Tracing Model
This RFC proposes a new tracing model for Sentry's performance product to better address future product improvements. The goal of this model is to allow storing entire traces without gaps, support dynamic sampling, indexing of spans and to extract metrics from spans pre-sampling. This is an evolution of the current transaction based approach.
Ive not been able to review this in depth but I wanted to escalate the convo we're having in Slack.
I do not believe a session should be a trace, or vice versa. To me these are different constructs. I, as a user, expect a trace to represent a request cycle. That is, when I take an action, a trace occurs, and it ends when that action fully completes. That said, there are problems in all models.
To illustrate what I mean, I want to call out several cases that begin root traces:
- I curl an API endpoint, it starts when the edge receives the request
- I load a SPA, it starts when the browser initiaties the request
- I load Slack, on my Desktop, it starts when I load it
- I start a match in League of Legends - or maybe I open the game client?
When do these traces end?
- API endpoint is straight forward. Commonly it'll end when the response is returned (unless some async work happens)
- SPA could end when the rendering completes, however, there may still be things going on. What happens with that instrumentation?
- Slack, it could end when the app has fully loaded, but what about all the other tasks going on. Should background tasks be their own trace? Then they're sessions again. Should whoevers pushing data to Slack initiate the trace? You lose context.
My pov is all models are trash. You lose data if you make traces behave as expected, which is what I originally describe here. Traces were not designed for long lived applications and frankly speaking they do not work for it.
Should we throw it all away, call it something else, and just go with it? e.g. we could just call it Session Tracing, explicitly say a trace generally speaking will live for an entire session? This has its own problems.
I load a SPA, or Slack, and keep it open for days. All those requests are a single trace.
- How do I sample? Sampling an entire session is prohibitively expensive, thus a session still needs bounds.
- How do I make sense of the trace data? Theres infinite segments/transactions inside of it. Segments solve for this but its still an immense amount of complexity.
- How do I make sense of interactions?
I'm sure there are more things I've forgotten to write up here, but the problem does not seem to have a right answer.
If we're really considering a new model, IMHO this would be a good time to tackle the time synchronization problem. Timestamps aren't enough IMHO, because each clock that generates them is synchronized independently. It's all to easy for a "distributed trace" to show nonsensical timestamps when viewed as a whole, such as server responses that complete before the request has been sent. I have some ideas here we could experiment with. Would this be welcome as part of this new model?
Sorry for rather drive-by comments but since I have a vested interest in better UI tracing a couple thoughts.
I want to note from experience that most SLO initiatives fail to meaningfully improve reliability in organizations. This is generally because SLOs are measured over individual services and endpoints, and that aggregation doesn't connect meaningfully to user experience. That's really the same problem @dcramer is calling out here.
There's a lot of opportunity to build a better model, but I think you'd have to think somewhat platform specific. If I live just in React in my head (though I think conceptually this fits the whole web platform), the meaningful measurements I would want for my users are:
- Route/url change
- Interactions that load new stuff inline (infinite scroll, drawers, modals, etc)
- Events to server (like form posts)
- Events from server (server sent events)
When you think about it - nearly all of these events will have some 'main' component/DOM element associated with them. Figuring out what that main component is is is hard because they nest, but logically:
- router loads
- New suspense or error boundary loads (well designed modals and drawers will have these, less well designed stuff this is tough)
- Button or link clicks/form posts
All mark the start of a new transaction and most likely the end of the previous one. They also likely represent a logical 'feature' in terms an average product manager would understand.
Server sent events are a wild-card because they represent an almost parallel thread to your main UI thread - but I do feel they could almost always be looked at separately as they are async and at least shouldn't represent real-time interaction (the chat scenario being the weird counterexample that must always exist - but then each chat exchange can likely be thought of as it's own transaction as well.)
This might be possible through auto-instrumentation if you are willing to go into the frameworks (historically y'all are). I think you end up with sort of a 'transaction type' and then a 'transaction boundary' following this schema - where the 'transaction type' is 'route | suspense | button/form' and the boundary is the name of the associated thing (dom ID/name, component name, etc).
You might find the OpenTelemetry Client Instrumentation SIG group interesting: https://docs.google.com/document/d/16Vsdh-DM72AfMg_FIt9yT9ExEWF4A_vRbQ3jRNBe09w/edit#heading=h.yplevr950565
Quick scan I didn't see Sentry mentioned in it yet. I think it would be great to consider join the group and solve it together.
@maggiepint You can have composite SLO and give different weights to the underlying ones, you can set the compositeWeight in a OpenSLO definition that allows for a end-to-end user journey.
What happened to this?
I experimented with visuals which wound up similar, but not identical to https://brycemecum.com/2023/03/31/til-mermaid-tracing/ that links here.
early 2023 this opened and it seems inactive, but doesn't really point at why.