apm [APM] Service instance runtime metrics

Summary of the problem (If there are multiple problems or use cases, prioritize them) Currently APM agents collect various system and runtime metrics, which could help detecting resource saturation or configuration issues. Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.

User stories

As App Ops, I need to correlate service performance with system and runtime performance.
As App Ops, I need to be able to identify when specific instance is performing differently than the majority of other instances.
As App Ops, I need to quickly identify which runtime metrics are trending out of normal at the same time as service is experiencing issues.

List known (technical) restrictions and requirements Has to work with different agent types and appreciate that each runtime has its own specific runtime metrics.

If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.

Jul 21 '20 22:07 alex-fedotyev

Pinging @elastic/observability-design (design)

Jul 21 '20 22:07 elasticmachine

We have three issues for runtime metrics:

Design issue: https://github.com/elastic/apm/issues/301 (this)
Meta issue (?): https://github.com/elastic/apm/issues/224
Implementation issue: https://github.com/elastic/kibana/issues/63573

Are all of them needed? I'm not sure what the purpose of the meta issue.

Jul 22 '20 09:07 sorenlouv

Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.

What are "these metrics"? Currently we show CPU and memory metrics for each agent (except java agent).

Do we want to keep showing metrics as averages across all hosts / vms / containers or are we going to show them per container like we do for java?

Jul 22 '20 09:07 sorenlouv