AKS [BUG] Metrics Server periodically returning service unavailable

Describe the bug Numerous events stating:

failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io) source: component: horizontal-pod-autoscaler

To Reproduce Steps to reproduce the behavior:

Install AKS cluster 1.30.3 with Istio enabled, API Server VNET + Node Autoprovision

Events start spamming event log. Metrics server is running correctly, however kubectl top pod fails sporadically with

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)

Metrics server appears to be running quite hot for a small cluster with < 10 nodes (using default AKS settings)

metrics-server-7dddddfd7d-22ddd 148m 88Mi metrics-server-7dddddfd7d-5wkcv 148m 72Mi

Expected behavior Metrics server doesn't periodically fail

Screenshots If applicable, add screenshots to help MetricServer explain your problem.

Environment (please complete the following information):

Kubernetes version 1.30.3

Additional context kubectl logs -n kube-system deployment/metrics-server Found 2 pods, using pod/metrics-server-7dddddfd7d-22ddd Defaulted container "metrics-server-vpa" out of: metrics-server-vpa, metrics-server I0906 22:40:45.124467 1 pod_nanny.go:86] Invoked by [/pod_nanny --config-dir=/etc/config --cpu=150m --extra-cpu=0.5m --memory=100Mi --extra-memory=4Mi --poll-period=180000 --threshold=5 --deployment=metrics-server --container=metrics-server] I0906 22:40:45.124570 1 pod_nanny.go:87] Version: 1.8.22 I0906 22:40:45.124594 1 pod_nanny.go:109] Watching namespace: kube-system, pod: metrics-server-7dddddfd7d-22ddd, container: metrics-server. I0906 22:40:45.124600 1 pod_nanny.go:110] storage: MISSING, extra_storage: 0Gi I0906 22:40:45.125127 1 pod_nanny.go:214] Failed to read data from config file "/etc/config/NannyConfiguration": open /etc/config/NannyConfiguration: no such file or directory, using default parameters I0906 22:40:45.125149 1 pod_nanny.go:144] cpu: 150m, extra_cpu: 0.5m, memory: 100Mi, extra_memory: 4Mi I0906 22:40:45.125159 1 pod_nanny.go:278] Resources: [{Base:{i:{value:150 scale:-3} d:{Dec:} s:150m Format:DecimalSI} ExtraPerResource:{i:{value:5 scale:-4} d:{Dec:} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:104857600 scale:0} d:{Dec:} s:100Mi Format:BinarySI} ExtraPerResource:{i:{value:4194304 scale:0} d:{Dec:} s:4Mi Format:BinarySI} Name:memory}]

Sep 10 '24 18:09 vikas-rajvanshy

I was attempting to mitigate by adding more resources to metrics-server using https://learn.microsoft.com/en-us/azure/aks/use-metrics-server-vertical-pod-autoscaler#manually-configure-metrics-server-resource-usage. Doesn't seem to help - approximately 1 out of 4 requests fail with:

Error from server (ServiceUnavailable): the server is currently unable to handle the request (get pods.metrics.k8s.io)

Maybe there is an incompatibility with :v0.6.3 of metrics-server that AKS uses?

Sep 11 '24 16:09 vikas-rajvanshy

@xiazhan we looked at this from the service-mesh side and do not think this is an issue with Istio. Could you investigate on metrics server side?

Dec 12 '24 18:12 SanyaKochhar

I've got the same issue. Any advice on that?

Mar 06 '25 15:03 yashak

Same there on AKS - 47 nodes running and HPA sometimes have difficulties to join metrics-server

Mar 28 '25 09:03 Nillu

This issue still occurs

Apr 28 '25 11:04 yashak

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. Please review @xiazhan.

May 28 '25 15:05 microsoft-github-policy-service[bot]

This issue has been automatically marked as stale because it has not had any activity for 30 days. It will be closed if no further activity occurs within 7 days of this comment. Please review @xiazhan, @kthakar1990, @stl327, @huizhifan.

Jul 25 '25 04:07 microsoft-github-policy-service[bot]

This issue will now be closed because it hasn't had any activity for 7 days after stale. @vikas-rajvanshy feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Aug 01 '25 08:08 microsoft-github-policy-service[bot]