opencensus-specs icon indicating copy to clipboard operation
opencensus-specs copied to clipboard

proposal: Better default sampler

Open semistrict opened this issue 7 years ago • 8 comments

I think the default sampler of 1/10k is causing a number of problems.

First, users who want to initially start doing tracing want to be able to see traces without rolling it out to production. In this case, they usually set the sampler to AlwaysSample.

Then, sometimes they commit this and forget to change it leading to unintentionally starting traces for all requests. In a high-traffic system, this can be very costly.

My initial proposal for discussion would be that we change the default sampler so that it is suitable for both development and production. For example, we could rate limit sampling up to 1 per second and thereafter, sample 1 in 1000 (say) up to 5 per second.

This should allow people to see traces during development and testing (all requests will be traces if the service sees less than 1qps) and should still be fine in production since we limit ourselves to 5 per second.

semistrict avatar Aug 09 '18 20:08 semistrict

agree

codefromthecrypt avatar Aug 10 '18 08:08 codefromthecrypt

/cc @bogdandrutu

semistrict avatar Aug 10 '18 17:08 semistrict

Changing the default is a breaking change which we may not be able to do that easily. I understand the problem and I think the current best solution is to add a qps based sampler and suggest that in our examples instead of always sample.

What do you think? @adriancole @ramonza

bogdandrutu avatar Aug 10 '18 18:08 bogdandrutu

Well we have a well-defined mechanism for making breaking API changes in Go. It seems like we should also have a way to make breaking non-API changes like this.

I definitely agree that it needs to be done slowly over time and can't just be changed outright and that the examples are the first thing we should update.

semistrict avatar Aug 10 '18 20:08 semistrict

I like the idea but also think it should be done carefully (no surprises to existing users).

Is there an explicit flag/mode that can enable this behavior for people who're trying to demo or debug? that way we don't surprise the existing users

shahprit avatar Aug 10 '18 21:08 shahprit

We are using adaptive sampling as a default in Application Insights. It adjusts sampling percentage so the rate of produced telemetry will be limited to 5 (default) per second. Since pricing is per volume of telemetry - this approach allows to estimate the cost of monitoring and keep it (virtually) constant on any high load including spikes. It also works great in dev environment.

More dev notes here.

Will this one be a better alternative?

SergeyKanzhelev avatar Sep 25 '18 05:09 SergeyKanzhelev

Is this the same as RateLimit sampling described here: https://github.com/census-instrumentation/opencensus-specs/blob/master/trace/Sampling.md

I agree that something like this is a better default.

semistrict avatar Sep 25 '18 21:09 semistrict

@ramonza idea of adaptive sampling is easier. Adaptive sampling is basically a wrapper on probability sampled that adjusts probability sampler config based on previous time interval. All decisions of adaptive sampler are delayed, but it make the spans easier to analyze. Each time frame sampling probability was constant and you can gain up numbers to estimate the real number of spans were seen in that interval.

SergeyKanzhelev avatar Sep 25 '18 22:09 SergeyKanzhelev