[Feature] AKS Managed Alerts (CLI/API)
AKS provides a set of preconfigured alerts on the most critical/common problem areas. This feature is already available through Portal today with a limited set of alerts: Recommended alert rules for Kubernetes clusters. This allows you to easily turn on alerts for issues such as node saturation, failed upgrades (WIP), and other scenarios.
We are looking to make the following improvements:
- Allow enablement of the alerts during cluster create and update in the CLI/API (Terraform, Bicep, etc.).
- Add in new alert scenarios to cover a wider range of critical AKS problem areas
I think it would be good to also link to the Prometheus runbook on how to troubleshoot each issue. Or a Microsoft docs page.
@julia-yin have an update?
Engineering work has kicked off for this feature, targeting Preview by August. We will be adding the following alerts scenarios to the current list:
- Cluster/nodepool upgrade failures
- API server overloaded
- ETCD database full
- Disk IOPS saturation
- Load balancer SNAT port exhaustion
I agree. By the way, I’m currently configuring the recommended alerts as IaC based on a GitHub template—Is there any way to hide/dismiss the “Enable recommended alert rules” message and its Enable button shown in the portal? I’m worried I might accidentally enable it and end up with duplicate alert rules.