sbd icon indicating copy to clipboard operation
sbd copied to clipboard

avoid some probably unnecessary watchdog-reboots with pacemaker_remot…

Open wenningerk opened this issue 9 years ago • 0 comments

…e by using the knowledge of the cib most recently received before a connection-loss

This is related to and actually needs: https://github.com/ClusterLabs/pacemaker/pull/1130

I've observed a lot of cases where together with pacemaker-remote and using a watchdog (without shared block-device) a lot of test-cases led to a watchdog-reboot on the remote-node. But in cases where there are no active resources on the remote-node it would probably not be needed to run into a suicide. I put together a couple of testcases I've tried to drive to an at least improved outcome seen from my pov:

without Cluster-Watcher:

sbd is satisfied as long as there is pacemaker_remoted running --> need to enable 
Pacemaker-Watcher

Behaviour with Pacemaker-Watcher before change in sbd and pacemaker:

1. the node running the remote-node-resource gets lost (e.g. virsh destroy) --> Watchdog-Reboot
    (timeout on the proxy connection that just died; takes way too long to retry on new connection)
2. graceful shutdown pacemaker on the cluster-nodes one by one --> Watchdog-Reboot
    (when last node goes down although resources got shut down in a clean way)
3. pcs resource disable {remote_node} --> Watchdog-Reboot
    (looses cib connection but actually all resources shut down in a clean way)
4. all cluster nodes are lost at once --> Watchdog-Reboot
    (yesss that is the one we want to happen)
5. all cluster-nodes but the one running the remote-node-resource are lost --> Watchdog-Reboot
    (would expect graceful shutdown of resources running on partial cluster without quorum)

Behaviour with Pacemaker-Patch setting TCP_USER_TIMEOUT to 1/2 of SBD-Watchdog-Timeout:

1. fixed as long as the remote-node-resource is taken over by other cluster-node quick enough

Behaviour with Pacemaker-Patch + SBD-Patch checking remaining cib-info for running resources on remote-node on cib-connection-lost:

1. fixed as above
2. fixed as when cib-connection is finally lost all resources have been brought down
    gracefully on remote-node as well
3. fixed as resources on remote-node are brought down gracefully before the connection is cut
4. still a wanted Watchdog-Reboot as cib-connection is cut while resources are running on
    remote-node and no other cluster node is taking over
5. fixed as long as sbd-watchdog-timeout is long enough that remote-node-resource is
    shut down properly before watchdog

wenningerk avatar Aug 24 '16 16:08 wenningerk