snappydata icon indicating copy to clipboard operation
snappydata copied to clipboard

Connecting to a standby leader should forward admin page

Open fire opened this issue 8 years ago • 12 comments

What is the feasibility of forwarding the admin page when connecting to a standby leader?

The use case is for a load balancer to be pointed to 2 leaders and it would get confused which one is the current leader. Thus sometimes the load balancer will point to an invalid admin page.

fire avatar Feb 18 '18 12:02 fire

@fire Can't the load balancer check which of the two is available? Only one of the host/port combo would be reachable at a time.

sumwale avatar Feb 18 '18 19:02 sumwale

For me both leader's port 5050 are reachable.

I use it to determine if the leader is up. Both up and standup listen on port 5050.

Invalid #1 Invalid #2

This is what the standby leader admin page looks like.

fire avatar Feb 18 '18 20:02 fire

For comparison this is a regular admin page.

Example #3

fire avatar Feb 18 '18 20:02 fire

Oh, ok. This can be improved to not open Spark UI port on standby leader. But can't you check the jobserver port (default 8090)? That should certainly not be open on standby leader.

sumwale avatar Feb 18 '18 21:02 sumwale

Not opening the spark ui port will cause the load balancer to go to the correct port. That is the proper solution.

Kubernetes doesn't allow me to do that, it only allows for me to pick one port and the same port on each pod (since they're the same image and same config.)

fire avatar Feb 19 '18 00:02 fire

Sorry, still confused. Why can't that port be 8090 on all pods?

sumwale avatar Feb 19 '18 05:02 sumwale

@fire You may add a readiness probe using httpGet for port 8090 of leader pods. This should fail for the standby leader.

dshirish avatar Feb 19 '18 10:02 dshirish

<kubernetes> The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.

The flow of the kubernetes system is the leader and the hot standby leader restarts until it is able to start. A liveness probe is used to test if a pod is a "zombie" and if so restart it.

However, the standby leader is not a zombie, it is constantly checking if the active leader is active, thus setting the liveness probe to httpGet for port 8090 means the standby leader is constantly restarting because the probe is failing. For kubernetes there is something like an exponential timeout delay before restarting a statefulset / etc, so the exact time that the standby leader will be scheduled to restart is undetermined. This means the time to failover for the leader is undetermined.

I'm not sure what defines an active vs inactive vs standby leader. There doesn't seem to be a commonality that I can test for.

@dshirish Mentioned that

  • active: port 5050 + port 8090
  • standby: port 5050 only
  • inactive: no ports

Currently I've disabled liveliness probes.

Reference

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

fire avatar Feb 19 '18 13:02 fire

@fire What @dshirish mentioned is a readiness probe and not liveness probe. The HTTP readiness probe will ensure that it will not receive any traffic but won't kill the application. Isn't that what is required?

sumwale avatar Feb 19 '18 15:02 sumwale

@fire we will track disabling Spark GUI on standby but for now readiness probe should work for your case

sumwale avatar Feb 19 '18 15:02 sumwale

I understand what you mean. So the pod is actually functioning, but isn't ready. So pod doesn't restart, but it's also not ready. Thank you.

fire avatar Feb 19 '18 15:02 fire

Why not simply have a single Lead (Pod) and let k8s restart the pod if it fails. Have a liveness probe to detect such conditions better. The pod restart with images cached should be real quick.

Our lead nodes are not active-active and doesn't provide HA for jobs that fail when the lead departs. You have to resubmit.


Jags SnappyData blog http://www.snappydata.io/blog Download binary, source https://www.snappydata.io/download

On Mon, Feb 19, 2018 at 7:41 AM, K. S. Ernest (iFire) Lee < [email protected]> wrote:

I understand what you mean. So the pod is actually functioning, but isn't ready. So pod doesn't restart, but it's also not ready. Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SnappyDataInc/snappydata/issues/970#issuecomment-366729513, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2KBuqLr_BnHoFdDOmCL6a6tAYNaaEbks5tWZYegaJpZM4SJrJM .

jramnara avatar Feb 19 '18 16:02 jramnara