avoid some probably unnecessary watchdog-reboots with pacemaker_remot…
…e by using the knowledge of the cib most recently received before a connection-loss
This is related to and actually needs: https://github.com/ClusterLabs/pacemaker/pull/1130
I've observed a lot of cases where together with pacemaker-remote and using a watchdog (without shared block-device) a lot of test-cases led to a watchdog-reboot on the remote-node. But in cases where there are no active resources on the remote-node it would probably not be needed to run into a suicide. I put together a couple of testcases I've tried to drive to an at least improved outcome seen from my pov:
without Cluster-Watcher:
sbd is satisfied as long as there is pacemaker_remoted running --> need to enable
Pacemaker-Watcher
Behaviour with Pacemaker-Watcher before change in sbd and pacemaker:
1. the node running the remote-node-resource gets lost (e.g. virsh destroy) --> Watchdog-Reboot
(timeout on the proxy connection that just died; takes way too long to retry on new connection)
2. graceful shutdown pacemaker on the cluster-nodes one by one --> Watchdog-Reboot
(when last node goes down although resources got shut down in a clean way)
3. pcs resource disable {remote_node} --> Watchdog-Reboot
(looses cib connection but actually all resources shut down in a clean way)
4. all cluster nodes are lost at once --> Watchdog-Reboot
(yesss that is the one we want to happen)
5. all cluster-nodes but the one running the remote-node-resource are lost --> Watchdog-Reboot
(would expect graceful shutdown of resources running on partial cluster without quorum)
Behaviour with Pacemaker-Patch setting TCP_USER_TIMEOUT to 1/2 of SBD-Watchdog-Timeout:
1. fixed as long as the remote-node-resource is taken over by other cluster-node quick enough
Behaviour with Pacemaker-Patch + SBD-Patch checking remaining cib-info for running resources on remote-node on cib-connection-lost:
1. fixed as above
2. fixed as when cib-connection is finally lost all resources have been brought down
gracefully on remote-node as well
3. fixed as resources on remote-node are brought down gracefully before the connection is cut
4. still a wanted Watchdog-Reboot as cib-connection is cut while resources are running on
remote-node and no other cluster node is taking over
5. fixed as long as sbd-watchdog-timeout is long enough that remote-node-resource is
shut down properly before watchdog