vip-manager icon indicating copy to clipboard operation
vip-manager copied to clipboard

VIP stays up when Patroni is down/not reachable

Open wolbernd opened this issue 4 months ago • 0 comments

Steps to Reproduce

  • two servers (serverA and serverB) each with patroni and vip-manager installed and configured
  • dcs-type is set to patroni. all other trigger related options are set to default
  • Currently serverA is Leader and has the VIP
  • Stop patroni on serverA (systemctl stop patroni)

expected Behaviour

  • serverB becomes db leader
  • vip-manager on serverB takes VIP
  • vip-manager on serverA releases VIP

current behaviour (vip-manager 4.0.0)

  • serverB becomes the leader
  • vip-manager on serverB activates the VIP
  • vip-manager on serverA does not release the VIP and even tries to get it back even though its dcs-backend (patroni) is not reachable
  • The VIP is switching between serverA and serverB since they both think they have to have it thus making database connection unreliable

Logs

vip-manager on serverA:

Sep 30 13:22:18 serverA vip-manager[803251]: 2025-09-30T13:22:18.668+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 30 13:22:18 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:18 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:18 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:18 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:19 serverA vip-manager[803251]: 2025-09-30T13:22:19.669+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
[...]
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.681+0200        ERROR        patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
Sep 30 13:22:29 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:29 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:29 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:29 serverA vip-manager[803251]:         /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.967+0200        INFO        IP address 10.0.99.64/24 is up, must be up

vip-manager on serverB:

Sep 30 13:21:49 serverB vip-manager[501796]: 2025-09-30T13:21:49.685+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:21:59 serverB vip-manager[501796]: 2025-09-30T13:21:59.685+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:09 serverB vip-manager[501796]: 2025-09-30T13:22:09.686+0200        INFO        IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200        INFO        IP address 10.0.99.64/24 is down, must be up
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200        INFO        Configuring address 10.0.99.64/24 on enp3s0
Sep 30 13:22:29 serverB vip-manager[501796]: 2025-09-30T13:22:29.603+0200        INFO        IP address 10.0.99.64/24 is up, must be up
Sep 30 13:22:39 serverB vip-manager[501796]: 2025-09-30T13:22:39.604+0200        INFO        IP address 10.0.99.64/24 is up, must be up

Possible Solution

One possible workaround would be to amend the systemd unit of vip-manager so that it starts and stops together with patroni:

[Unit]
Description=Manages Virtual IP for Patroni
After=network-online.target
Before=patroni.service
PartOf=patroni.service

[Service]
Type=simple

ExecStart=/usr/bin/vip-manager --config=/etc/default/vip-manager.yml

Restart=on-failure

[Install]
WantedBy=multi-user.target
WantedBy=patroni.service

However this solution would only work if the systemd unit is stopped (either by a user or by systemd itself in case the main process crashes). This would not trigger if the patroni process hangs for some reason.

A better solution would be to release the VIP if the dcs-endpoint is not reachable since the leader role will probably not be on any server where patroni is not running.

wolbernd avatar Sep 30 '25 11:09 wolbernd