kiam kiam-agent health check failed Liveness probe failing

Hi, we have done load testing on our kube cluster today and we brought node from 50 to 170 and pods from 300 to around 700. we have increase new node count when there were too many pending pods. and we have noticed on many node kiam-agent starting crashloopbackoff because of liveliness prob failed.

Liveness probe failed: Get http://10xxx:8181/ping: dial tcp xxx:8181: connect: connection refused

so at the start, we think it might be because of too many pods getting scheduled on same node but after few minuted when all the pods were running and avg. utilization of node was around 40%. kiam agent health check was still failing(or kiam was not even started but health check have started running before that).

so we thought kiam-server might be bottleneck as it was running on 3 masters as daemonset. so we created deployment and started 10 more kiam-server but nothing good happens those pods were failing even new pods which came after I deleted old one were also failing.

         livenessProbe:
           failureThreshold: 3
           httpGet:
             path: /ping
             port: 8181
             scheme: HTTP
           initialDelaySeconds: 3
           periodSeconds: 3
           successThreshold: 1
           timeoutSeconds: 1

we are running kiam 3.0.

Jan 10 '19 14:01 Deepak1100

It's hard to diagnose further without any log or metric data- could you give output for server and agents during this, log out from health checks etc.

The numbers you quote aren't high so it feels unrelated to load but without more data it's hard to suggest more.

Jan 11 '19 21:01 pingles

@pingles The health check currently is pretty weak. It just returns "pong" in response. Maybe we need to extend the health check of the agent to ping the kiam server atleast. What might happen is there might be some issue with the underlying CNI provider due to which the agent <-> server communication is hampered.

And also by definition, I'll call my kiam agent alive if it can communicate with the kiam server, else it's as good as dead to me.

Please correct me if I'm wrong.

Jan 15 '19 10:01 thejasbabu

It's a good question. Currently the agent will attempt to re-initiate connection to the server in the event of networking or other transient errors. The usual deployment topology (an agent running on each node) also means that there's not much value in having the agent flap in the event of it not connecting to the server? As long as the agent process is healthy (which the health-check confirms) then it will attempt to re-establish a connection to any server.

@uswitch/cloud ?

Jan 15 '19 10:01 pingles

@thejasbabu from your comment above, does that mean that you've figured out what the problem of your original issue is?

Jan 17 '19 10:01 pingles

We've had the kiam-agent terminated for failing liveness probe too - may I suggest adding requests for CPU and memory in agent.yaml. Absent those the default Kubernetes behaviour is to give each container a CPU share of 0.2%, so if your machine does get busy it will be starved out. (ref #222)

Mar 20 '19 12:03 bboreham

any news on this issue?

Jun 10 '20 13:06 andronux