featurebase Revisit consistent hashing implementation for shard distribution

Pilosa currently uses the Jump Consistent Hashing Algorithm to distribute shards to nodes in the cluster.

The main strength of using a consistent hash is that it minimizes the amount of data transfer when a node is added or removed from the cluster. In particular, the Jump hash outperforms other consistent hashing algorithms by enforcing a restriction that the nodes are ordered, so nodes can only be added or removed if they are the last one/s.

However, a node in Pilosa can be removed or added even if it is in the middle. As an example, if we have node0, node1, node2, adding a node3 in Jump hash causes around 1/3 of data from the first three nodes to be placed in node3. However, removing node1 causes node2 to be the new "second node", so the data originally in node1 (before node3 was even added) will all go to node2, and all the data in previously in node2 goes to node3. This defeats the purpose of a consistent hash.

There are a few ways we can fix this. One way is to explicitly map the node order to a node. In the above example, we can map the value 2 to node2, so that when node1 is removed, we would send the node mapped to the highest value (node3) to the value 1. This might be too complicated to implement though. Moreover, we're currently seeing bad performance for the jump hash with a 130% ratio for max : average distribution of shards per node (for 1000-10000 shards on 10 nodes).

Another way is to use a different hashing algorithm altogether. Either use a new consistent hashing algorithm that still minimizes data transfer but does not depend on node order, or a regular hash that does not minimize data transfer on node add/leave events, but balances number of shards per node reasonably well for large number of shards and nodes.

Jul 17 '19 23:07 shaqque

Did some testing to confirm that with current implementation, removing a random node not at the end causes a lot of data shuffling.

In particular, for 1000 shards spread across 6 node, removing the first node causes the difference in data content per node to be

min  max  ave
145  178  159.4

which is effectively reshuffling all the data in each node.

The current goal is to implement a new consistent hashing algorithm that (1) balances the shard number in each node, and (2) minimizes data reshuffling on node add/leave events.

Jul 18 '19 17:07 shaqque

I think we can figure out a way to make new nodes always join at the end of the list. When nodes leave, it's either because the cluster is being downsized which means we can arrange for the last node in the cluster to be removed, or because a node actually got sick and died, which probably means it should be replaced rather than permanently removing it.

So, let's not be too concerned with nodes having to join at the end of the list being a problem, but getting the data spread out evenly, and minimizing data movement on node join are both priorities.

On Thu, Jul 18, 2019 at 12:35 PM Shaquille Wyan Que < [email protected]> wrote:

Did some testing to confirm that with current implementation, removing a random node not at the end causes a lot of data shuffling.

In particular, for 1000 shards spread across 6 node, removing the first node causes the difference in data content per node to be

min max ave 145 178 159.4

which is effectively reshuffling all the data in each node.

The current goal is to implement a new consistent hashing algorithm that (1) balances the shard number in each node, and (2) minimizes data reshuffling on node add/leave events.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pilosa/pilosa/issues/2035?email_source=notifications&email_token=AAHCC455D2WPVQ66IALTYV3QACSWRA5CNFSM4IEVL4S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2JHIOY#issuecomment-512914491, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHCC4Z5G27A5R72XVVZFP3QACSWRANCNFSM4IEVL4SQ .

Jul 18 '19 17:07 jaffee

Think of it as ring, not a list

Jul 18 '19 17:07 tgruben

Jump hash doesn't do too well on the spreading out evenly part, regardless of a node join/leave. It also treats the nodes as a list, not a ring.

So I think we're all in agreement to implement a different consistent hashing algorithm altogether

Jul 18 '19 17:07 shaqque