vip-manager
vip-manager copied to clipboard
VIP stays up when Patroni is down/not reachable
Steps to Reproduce
- two servers (serverA and serverB) each with patroni and vip-manager installed and configured
-
dcs-typeis set topatroni. all other trigger related options are set to default - Currently serverA is Leader and has the VIP
- Stop patroni on serverA (
systemctl stop patroni)
expected Behaviour
- serverB becomes db leader
- vip-manager on serverB takes VIP
- vip-manager on serverA releases VIP
current behaviour (vip-manager 4.0.0)
- serverB becomes the leader
- vip-manager on serverB activates the VIP
- vip-manager on serverA does not release the VIP and even tries to get it back even though its dcs-backend (patroni) is not reachable
- The VIP is switching between serverA and serverB since they both think they have to have it thus making database connection unreliable
Logs
vip-manager on serverA:
Sep 30 13:22:18 serverA vip-manager[803251]: 2025-09-30T13:22:18.668+0200 ERROR patroni REST API error:Get "http://127.0.0.1:8008//leader": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Sep 30 13:22:18 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:18 serverA vip-manager[803251]: /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:18 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:18 serverA vip-manager[803251]: /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:19 serverA vip-manager[803251]: 2025-09-30T13:22:19.669+0200 ERROR patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
[...]
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.681+0200 ERROR patroni REST API error:Get "http://127.0.0.1:8008//leader": dial tcp 127.0.0.1:8008: connect: connection refused
Sep 30 13:22:29 serverA vip-manager[803251]: github.com/cybertec-postgresql/vip-manager/checker.(*PatroniLeaderChecker).GetChangeNotificationStream
Sep 30 13:22:29 serverA vip-manager[803251]: /home/runner/work/vip-manager/vip-manager/checker/patroni_leader_checker.go:52
Sep 30 13:22:29 serverA vip-manager[803251]: main.main.func3
Sep 30 13:22:29 serverA vip-manager[803251]: /home/runner/work/vip-manager/vip-manager/main.go:65
Sep 30 13:22:29 serverA vip-manager[803251]: 2025-09-30T13:22:29.967+0200 INFO IP address 10.0.99.64/24 is up, must be up
vip-manager on serverB:
Sep 30 13:21:49 serverB vip-manager[501796]: 2025-09-30T13:21:49.685+0200 INFO IP address 10.0.99.64/24 is down, must be down
Sep 30 13:21:59 serverB vip-manager[501796]: 2025-09-30T13:21:59.685+0200 INFO IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:09 serverB vip-manager[501796]: 2025-09-30T13:22:09.686+0200 INFO IP address 10.0.99.64/24 is down, must be down
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200 INFO IP address 10.0.99.64/24 is down, must be up
Sep 30 13:22:19 serverB vip-manager[501796]: 2025-09-30T13:22:19.592+0200 INFO Configuring address 10.0.99.64/24 on enp3s0
Sep 30 13:22:29 serverB vip-manager[501796]: 2025-09-30T13:22:29.603+0200 INFO IP address 10.0.99.64/24 is up, must be up
Sep 30 13:22:39 serverB vip-manager[501796]: 2025-09-30T13:22:39.604+0200 INFO IP address 10.0.99.64/24 is up, must be up
Possible Solution
One possible workaround would be to amend the systemd unit of vip-manager so that it starts and stops together with patroni:
[Unit]
Description=Manages Virtual IP for Patroni
After=network-online.target
Before=patroni.service
PartOf=patroni.service
[Service]
Type=simple
ExecStart=/usr/bin/vip-manager --config=/etc/default/vip-manager.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
WantedBy=patroni.service
However this solution would only work if the systemd unit is stopped (either by a user or by systemd itself in case the main process crashes). This would not trigger if the patroni process hangs for some reason.
A better solution would be to release the VIP if the dcs-endpoint is not reachable since the leader role will probably not be on any server where patroni is not running.