[APM] Service instance runtime metrics
Summary of the problem (If there are multiple problems or use cases, prioritize them) Currently APM agents collect various system and runtime metrics, which could help detecting resource saturation or configuration issues. Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.
User stories
- As App Ops, I need to correlate service performance with system and runtime performance.
- As App Ops, I need to be able to identify when specific instance is performing differently than the majority of other instances.
- As App Ops, I need to quickly identify which runtime metrics are trending out of normal at the same time as service is experiencing issues.
List known (technical) restrictions and requirements Has to work with different agent types and appreciate that each runtime has its own specific runtime metrics.
If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.
Pinging @elastic/observability-design (design)
We have three issues for runtime metrics:
- Design issue: https://github.com/elastic/apm/issues/301 (this)
- Meta issue (?): https://github.com/elastic/apm/issues/224
- Implementation issue: https://github.com/elastic/kibana/issues/63573
Are all of them needed? I'm not sure what the purpose of the meta issue.
Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.
What are "these metrics"? Currently we show CPU and memory metrics for each agent (except java agent).
Do we want to keep showing metrics as averages across all hosts / vms / containers or are we going to show them per container like we do for java?