AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[Feature] AKS Managed Alerts (CLI/API)

Open julia-yin opened this issue 9 months ago • 4 comments

AKS provides a set of preconfigured alerts on the most critical/common problem areas. This feature is already available through Portal today with a limited set of alerts: Recommended alert rules for Kubernetes clusters. This allows you to easily turn on alerts for issues such as node saturation, failed upgrades (WIP), and other scenarios.

We are looking to make the following improvements:

  1. Allow enablement of the alerts during cluster create and update in the CLI/API (Terraform, Bicep, etc.).
  2. Add in new alert scenarios to cover a wider range of critical AKS problem areas

julia-yin avatar Apr 09 '25 16:04 julia-yin

I think it would be good to also link to the Prometheus runbook on how to troubleshoot each issue. Or a Microsoft docs page.

PixelRobots avatar Apr 09 '25 21:04 PixelRobots

@julia-yin have an update?

olsenme avatar May 09 '25 22:05 olsenme

Engineering work has kicked off for this feature, targeting Preview by August. We will be adding the following alerts scenarios to the current list:

  • Cluster/nodepool upgrade failures
  • API server overloaded
  • ETCD database full
  • Disk IOPS saturation
  • Load balancer SNAT port exhaustion

julia-yin avatar May 28 '25 23:05 julia-yin

I agree. By the way, I’m currently configuring the recommended alerts as IaC based on a GitHub template—Is there any way to hide/dismiss the “Enable recommended alert rules” message and its Enable button shown in the portal? I’m worried I might accidentally enable it and end up with duplicate alert rules.

Image

torumakabe avatar Aug 23 '25 02:08 torumakabe