skipper panic on rolling out SWARM

Describe the bug Upon rolling out deployment changes enabling swarm mode, the new pods error out with

[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=info msg="SWARM: <nil> is going to join swarm of 9 nodes ([NodeInfo{name: skipper-ingress-5cc58d7889-c4zwf, 172.16.7.27:9990} NodeInfo{name: skipper-ingress-5cc58d7889-n58f4, 172.16.12.158:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qbzt5, 172.16.15.209:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qw6zk, 172.16.7.77:9990} NodeInfo{name: skipper-ingress-5cc58d7889-v828t, 172.16.23.212:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-9n8vg, 172.16.7.133:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-pjc6r, 172.16.20.170:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-q98fx, 172.16.23.96:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-vt5nl, 172.16.12.240:9990}])"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xe9e9cf]

goroutine 1 [running]:
github.com/zalando/skipper/swarm.Join({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0}, ...)
    /workspace/swarm/swarm.go:209 +0x14f
github.com/zalando/skipper/swarm.Start({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
    /workspace/swarm/swarm.go:197 +0x18e
github.com/zalando/skipper/swarm.newKubernetesSwarm({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
    /workspace/swarm/swarm.go:189 +0x3e5
github.com/zalando/skipper/swarm.NewSwarm(0xc0000a6000?)
    /workspace/swarm/swarm.go:130 +0x1d8
github.com/zalando/skipper.run({0xa7a358200, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x7ffd3100bb2b, 0x5}, 0x0, ...}, ...)
    /workspace/skipper.go:1421 +0x50e5
github.com/zalando/skipper.Run(...)
    /workspace/skipper.go:1720
main.main()
    /workspace/cmd/skipper/main.go:47 +0x12e

To Reproduce Add context, for example:

existing Skipper deployment
update deployment to enable SWIM: https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#swim-based
pods crash until the 2nd last pod is rolled

Expected behavior exclude Skipper pods of the SWARM list that are not yet enabled

May 09 '22 14:05 universam1

This was answered in chat, for other people I post here the message:

 we don't use it and for now have no time to check it really in depth. The nil ptr dereference you posted in the issue, I never observed in my tests. I tried a while to make the algorithm work for having consistent results but I think I have to write a test tool to understand if we use the data sharing wrong or whatever.
Redis is quite reliable setup with 1 cpu usage and then use sharding to scale out. 1 redis can handle up to 40k rps, but it does not scale linearly by factor 1 but something like 0.7 (more feeling than data). In aws ec2 the isolation seems to be a bit weak for linearly scaling by factor 1.

thanks for filing the issue.

May 11 '22 20:05 szuecs

So self is nil and we should return an error in swarm.Join()

May 11 '22 20:05 szuecs

Created a PR to fix this minor bug.

Oct 26 '22 21:10 demonCoder95

@szuecs this can be marked as closed now.

Oct 27 '22 16:10 demonCoder95