ringpop-go icon indicating copy to clipboard operation
ringpop-go copied to clipboard

Leader election

Open kellabyte opened this issue 9 years ago • 10 comments

The Ringpop documentation has a section about implementing leader election but it's empty. Any chance you could shed some light on how to do so with Ringpop?

kellabyte avatar Oct 09 '16 03:10 kellabyte

Hi @kellabyte,

There is an example implementation of leader election in the ringpop-node repo: https://github.com/uber/ringpop-node/blob/master/examples/leader_election.js

This should be easily portable to Golang. 😄

dansimau avatar Oct 10 '16 08:10 dansimau

I'm not familiar with node. Could you explain how this works please? :) Does this code run on every single node? How do they compete for leadership and ensure safety and stable leadership?

Or is this an application that runs once and just selects a leader among all the nodes?

kellabyte avatar Oct 10 '16 15:10 kellabyte

The linked example shows an implementation of leadership election on top of Ringpop. You would build this into your app.

The way it works is that it hooks into Ringpop's ringChanged event. When the ring changes (e.g. nodes come or go) the leadership election code runs, in each node of your app.

Each node uses a pre-shared arbitrary key (https://github.com/uber/ringpop-node/blob/master/examples/leader_election.js#L27). That key will hash to a single node. Whichever node it hashes to is designated the leader.

However, be aware that SWIM and Ringpop are "weakly-consistent". When a ring change occurs, there will be a period of ring disruption: it takes a few hundred milliseconds to a few seconds for the ring to be stable, depending on the size of the ring.

During this time, the leader will recomputed on each node many times and the leader could change multiple times.

If this is a problem, you could delay the leader re-election by buffering ring change events and only firing the re-election code once when the ring is stable. We use this technique in production. This comes at the cost of the "availability" (i.e. there will be a delay before the new leader is decided).

Hope this helps.

dansimau avatar Oct 10 '16 15:10 dansimau

Thank you for this description! I have a few more questions :)

Can you define what a ring change means? Does it mean a node joins or expires? Also isn't SWIM an epidemic protocol meaning there's multiple perspectives of the truth at any one time and that ring changes aren't globally consistent and that the hashing of the leader may result in multiple leaders for a period of time?

kellabyte avatar Oct 11 '16 02:10 kellabyte

Can you define what a ring change means? Does it mean a node joins or expires?

Ring change is basically a node joining or leaving (i.e. being marked as faulty).

Also isn't SWIM an epidemic protocol meaning there's multiple perspectives of the truth at any one time and that ring changes aren't globally consistent and that the hashing of the leader may result in multiple leaders for a period of time?

Exactly. That's what I meant by a "period of ring disruption". Under normal operating conditions, you should expect the the ring to stabilise (i.e. settle on a single view of the world) within a few seconds.

dansimau avatar Oct 11 '16 09:10 dansimau

Can you stack or combine ring topology updates? What if you want to add a bunch of nodes, remove some, and assign different tokens to others? Does each of those operations need to be processed one after the other, or you can combine them as a single operation ?

markpapadakis avatar Oct 11 '16 09:10 markpapadakis

Can you stack or combine ring topology updates? What if you want to add a bunch of nodes, remove some, and assign different tokens to others? Does each of those operations need to be processed one after the other, or you can combine them as a single operation ?

Each change to the ring will trigger an event. You can use a delay to coalesce a series of ring change events into a single action.

In terms of making changes to the ring: ring changes occur automatically in the system itself. An operator does not have have fine-grained control over setting the membership of the ring, apart from being able to add and remove nodes. The system takes care of itself, if that makes sense.

dansimau avatar Oct 11 '16 09:10 dansimau

Each change to the ring will trigger an event. You can use a delay to coalesce a series of ring change events into a single action.

We have had some far from optimal results from this kind of design in the past. The problem is, if you don't consider multiple updates as a single "atomic" operation, then you get to potentially migrate(data motion) a lot more data than you should. Instead, if you consider:

  1. The current topology
  2. The segments all distinct nodes involved in the transition current replicate
  3. The segments all distinct nodes involved in the transition will replicate after the updates are updated (final ring)

you can compute the exact list of segments that need to be moved (segment => target, list of sources nodes that can provide those segments), and this can be extremely important for reducing cost, overhead, and time to accomplish the transition.

You may want to consider a future upgrade where you don't emit an event or otherwise broadcast that the ring topology has changed with every change because this can result in a lot of unnecessary data transfers and can potentially confuse the cluster. Ring topology rarely changes IMO, and when it does, it's when you want to introduce more nodes(say, 5 more because of demand) or shrink (say by 2 because demand subsided), or when some node(s) are now more or less powerful and can handle more or less work / data (so you can adjust their vnodes ownership in the ring). So, it's probably a good idea to account for that.

( I haven't looked into your codebase yet though, so you may be doing all that already, I just arrived here via Kelly's tweet :)

markpapadakis avatar Oct 11 '16 09:10 markpapadakis

Each change to the ring will trigger an event. You can use a delay to coalesce a series of ring change events into a single action.

But this leader election isn't a safe guarantee based on such a weakly consistent system, is it? You could get a split brain that is longer than the coalescing timeout. Wouldn't 2 leaders get elected via the hash selection on each side of the partition?

kellabyte avatar Oct 11 '16 12:10 kellabyte

Correct. This is a property of a weakly consistent system.

dansimau avatar Oct 11 '16 12:10 dansimau