can't create cluster over localhost:7777 tunneled connection
I think the "peer already known" logic needs to take into account the port as well as the host; or perhaps it just needs to treat localhost specially. I setup an ssh tunnel (using ssh -L 7777:localhost:7481 remotehost) between machines in EC2 to run some benchmarks, but I can't seem to make a cluster over the tunnel:
$ summitdb -join localhost:7777
24510:M 23 Jan 06:17:45.894 * summitdb 0.3.2
24510:N 23 Jan 06:17:45.897 * Node at :7481 [Follower] entering Follower state (Leader: "")
24510:N 23 Jan 06:17:45.898 # failed to join node at localhost:7777: peer already known
$
hmm... actually, upon further investigation, this errors seems to be coming from the vendored raft here: https://github.com/tidwall/summitdb/blob/master/vendor/github.com/hashicorp/raft/raft.go#L1101
I will continue to investigate. Ideas about how to approach this and workaround thoughts welcome.
I tried giving the 2nd peer -p 7480 to start it on a different port. Better, but still no luck:
106692:N 23 Jan 06:31:52.857 # Election timeout reached, restarting election
106692:N 23 Jan 06:31:52.857 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:31:52.858 # Failed to make RequestVote RPC to :7480: dial tcp :7480: get\
sockopt: connection refused
106692:N 23 Jan 06:31:54.124 # Election timeout reached, restarting election
106692:N 23 Jan 06:31:54.124 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:31:54.125 # Failed to make RequestVote RPC to :7480: dial tcp :7480: get\
sockopt: connection refused
... more of the same...
The first peer seems to want to dial via tcp directly, rather than re-using the existing (tunnelled) connection to the 7480 peer.
interestingly, even removing the 2nd peer does not work, and no leader is elected from the one viable node:
127.0.0.1:7481> raftremovepeer ":7480"
(error) ERR leader not known
127.0.0.1:7481>
1st node continues to say:
06692:N 23 Jan 06:54:41.831 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused
106692:N 23 Jan 06:54:43.429 # Election timeout reached, restarting election
106692:N 23 Jan 06:54:43.429 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:54:43.431 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused
106692:N 23 Jan 06:54:45.124 # Election timeout reached, restarting election
106692:N 23 Jan 06:54:45.125 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:54:45.126 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused
I would prefer that "raftremovepeer" be a bit more aggressive here, so as to restore the cluster to a functioning state.
(I do realize this is all the underlying raft implementation, and little to do with summitdb proper.)
I haven't played to much with ssh tunneling over raft, so I'm trying to catch up. I'll have to investigate further to fully wrap my head around it.
Regarding the raft implementation, as I understand all the peers must be able to reach each other using the same host:port combination. Would it help to create entries in the hosts file to alias localhost?
I didn't set up symmetric tunnels, so it's my bad.
I'm sure it simplifies the raft code to assume full peer-to-peer connectivity, both acting as client and both acting as "server".
It does end up simulating split-brain pretty well though. I wonder why hashicorp raft has such a difficult time recovering from it. Might be because I never got to 3 nodes, only 1 and then 1.5