chalk icon indicating copy to clipboard operation
chalk copied to clipboard

[Objective] Very Flexible Runtime Data Collection

Open viega opened this issue 2 years ago • 0 comments

In our quest for better observability of software, Chalk currently allows very lightweight, limited data collection in production, at process startup and at a regular interval. However, there are some challenges there:

  1. Noise. Current heart beat reports send the same metadata every heartbeat. This needs to change; for instance, @nettrino has proposed an invariant-based system we are considering, where basically you can set up keys to report only if the values change.
  2. Missing data. Some data that can be worth collecting about an app or its environment tends to be fairly transient, for instance inbound network connections. That should be collectable.
  3. Timeliness. For some data, it'd be would be better to be able to get it quickly as it is produced, where periodic reporting isn't as appropriate. For instance, we heard from serverless developers how big a pain it is to get debug logs; long waiting, then lots of grepping.
  4. Post-deployment querying / configuring. A lot of questions people have around their apps can be best served with lightweight querying. Look at the success of OSQuery, despite the fact that it is fairly heavyweight... it works pretty well, with good controls in place to manage performance requirements. Still, it's not appropriate at the application level, especially in containerized or serverless environments.
  5. Deeper introspection into the app. The more observability people get, the more value they see from it. At the system level, ebpf has shown the value of deep (but safe) introspection, but it is not viable in environments like serverless, Fargate, or really any container-based runtimes based on true virtualization. For instance, using Log4J as an example, everyone saw they had it on images all over the place, but what they really wanted to know is, "is it used?" Which can be cheap and easy to answer at the application level (if you've also solved item 4 above).

Of course, in production infrastructure, performance and cost is tantamount for all the items above. "Do no harm" should definitely be our mantra, with good controls to give people confidence on those issues.

viega avatar Oct 02 '23 17:10 viega