Key Metrics for Tracing Systems
Hello,
not an issue, but rather a question. Could people share some numbers on what tracing bandwidths they observe πΊππ¦π‘ππ /hour peak/average? Is there something like average/typical span size per application/per server? What are key metrics that really matter for a tracing system? Are there some taxonomies for applications e.g. lyft/envoy typically creates 12kb/trace while a typical web application generates ~30kb/trace?
I think this is hard to tell. The span size depends the tags and logs, you collected (tagged by users, do not mean collected by tracing system). In my tracing system(commercial edition), the span size is changing at different scenario, even trace the same service. Same story in trace size.
I think the major problem is how much cost will you take, when you are doing tracing things. In mine, 5%-15% CPU cost is the dead-line, so we did everything we can to make sure this happened.
@wu-sheng - interesting metric. 15% meaning 15% overhead on the application it traces? i.e. if the app uses 60% of 8 CPUs, app + tracing uses 70%?
@lookfwd Yes, that is my story. :) You should choose yours, based on your oss-demands.
@lookfwd sorry for the delay, I missed this the first time around :-/
There is huge variation on this front. I have seen plenty of "real" production environments with relatively low data volumes... e.g., public companies that generate on the order of 5-10MB/sec of tracing data globally and without sampling. I have also seen plenty of equally "real" production environments that generate vast amounts of tracing data. E.g., google recently cited (publicly) the fact that, globally, they generate 10s of billions of requests per second. At hundreds of bytes per span, that's well over 1TB/sec of trace data. Yikes.
I usually assume that a Span takes up 100-500 bytes when all is said and done, but that makes lots of assumptions.
Interesting. Makes perfect sense! For me there might be something like a taxonomy related to a) capabilities of tracing systems b) cost of implementation c) latency and d) approach in terms of technologies one can use with implementation.

The bottom layer is the basic stuff i.e. a few bytes with ids and timestamps going from one service to another. Those have to be stored reliably and real-time. You don't want to lose those data in case of a crash since they will certainly help you understand what happened. This means that you likely have to put those on some efficient IPC (socket, pipe, shared memory) ASAP. You can use those data to reconstruct a basic diagram of the trace.
The middle layer has significantly more data. It includes operation names, get/post, arguments etc. some stuff that can be used to aid basic debugging and interesting aggregations in terms of latencies. Those data don't really need to be real time. They might need to be sampled since they might be lots. Those can be flushed e.g. every few seconds to a central system via sockets or files. If they're lost in case of a crash, it's bad, but not that bad, since likely, you can recover all the data that lead to the crash from other nodes.
The top layer might have tons of application-specific data that can be used along with information from all the other layers to do complex predictive analytics, enable automation and very detailed debugging. This is similar to logging, just in a way that can be put back together hierarchically and form a full trace. For this layer one highly likely needs a "big data" batch processing system to index and analyse those data. Data might be left into individual servers and be recalled on-demand. This layer could be implemented on top of something like Kibana or Splunk. Will almost definitely use files. Again, data loss isn't huge problem for this layer.
Any thoughts?