cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

admission: additional observability

Open irfansharif opened this issue 3 years ago • 5 comments

In order of importance and/or done-ness:

  • [x] #87883;
  • [x] #87424;
  • [ ] #88076;
  • [ ] Metric capturing compaction bandwidth out of L0 (which is used to generate write tokens in admission control);
  • [ ] We could log the {max,min} slot count and {max,min} runnable goroutine count every second, or export metrics for it. In internal experimentation we find ourselves reaching for it.

Jira issue: CRDB-16641

irfansharif avatar Jun 10 '22 19:06 irfansharif

Perhaps exporting Go's /sched/latencies:seconds to have visibility in Go scheduler latencies.

This has proven extremely valuable to do in internal AC-related experiments (re: #75066). https://github.com/irfansharif/cockroach/tree/220614.export-tracing is a prototype that grafts together the prometheus-compatible data from https://github.com/prometheus/client_golang/blob/main/prometheus/go_collector_latest.go, and looks as follows:

image

Through it we were able to correlate foreground latency spikes to Go scheduler latency spikes.

irfansharif avatar Jun 21 '22 17:06 irfansharif

From an internal doc, re: "Information needed from Go runtime":

Runnable info: Minimally, we need the number of runnable goroutines, sampled at
some reasonably high rate (100hz?). It would be preferable to get a delta value
of total duration spent in Runnable and Running state since the last sample (or
a cumulative number, from which we can compute the delta). The duration is less
sensitive to observing spikes in runnable goroutines, which quickly get
scheduled, which does not necessarily represent scarcity of cpu resources.

IIUC, this is exactly the total sum of everything captured within /sched/latencies:seconds.

irfansharif avatar Jul 04 '22 15:07 irfansharif

Exporting segmented latency histograms by different priority levels as seen by admission control, to capture what classes of requests are observing queuing and by how much;

We need this to make sense of mixed workload behavior (e.g. conversation in https://cockroachlabs.slack.com/archives/C038JEXC5AT/p1658247509643359?thread_ts=1657630075.576439&cid=C038JEXC5AT)

sumeerbhola avatar Jul 19 '22 17:07 sumeerbhola

Adding Andrew here too to pick through the list within the next two weeks, it'll be a good way to get our feet wet.

irfansharif avatar Aug 24 '22 14:08 irfansharif

@andrewbaptist: I'm working on the "Exporting Go's /sched/latencies:seconds" as a histogram. Want to take on the remaining?

irfansharif avatar Sep 12 '22 18:09 irfansharif

Discussed offline with @sumeerbhola, closing this issue as the changes we want to do already have separate issues that are being tracked in the backlog with priorities assigned to them. The rest of them, we don't want to invest time into doing.

aadityasondhi avatar Dec 12 '23 19:12 aadityasondhi