skipper
skipper copied to clipboard
panic on rolling out SWARM
Describe the bug Upon rolling out deployment changes enabling swarm mode, the new pods error out with
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=info msg="SWARM: <nil> is going to join swarm of 9 nodes ([NodeInfo{name: skipper-ingress-5cc58d7889-c4zwf, 172.16.7.27:9990} NodeInfo{name: skipper-ingress-5cc58d7889-n58f4, 172.16.12.158:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qbzt5, 172.16.15.209:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qw6zk, 172.16.7.77:9990} NodeInfo{name: skipper-ingress-5cc58d7889-v828t, 172.16.23.212:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-9n8vg, 172.16.7.133:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-pjc6r, 172.16.20.170:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-q98fx, 172.16.23.96:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-vt5nl, 172.16.12.240:9990}])"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xe9e9cf]
goroutine 1 [running]:
github.com/zalando/skipper/swarm.Join({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0}, ...)
/workspace/swarm/swarm.go:209 +0x14f
github.com/zalando/skipper/swarm.Start({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
/workspace/swarm/swarm.go:197 +0x18e
github.com/zalando/skipper/swarm.newKubernetesSwarm({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
/workspace/swarm/swarm.go:189 +0x3e5
github.com/zalando/skipper/swarm.NewSwarm(0xc0000a6000?)
/workspace/swarm/swarm.go:130 +0x1d8
github.com/zalando/skipper.run({0xa7a358200, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x7ffd3100bb2b, 0x5}, 0x0, ...}, ...)
/workspace/skipper.go:1421 +0x50e5
github.com/zalando/skipper.Run(...)
/workspace/skipper.go:1720
main.main()
/workspace/cmd/skipper/main.go:47 +0x12e
To Reproduce Add context, for example:
- existing Skipper deployment
- update deployment to enable SWIM: https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#swim-based
- pods crash until the 2nd last pod is rolled
Expected behavior exclude Skipper pods of the SWARM list that are not yet enabled
This was answered in chat, for other people I post here the message:
we don't use it and for now have no time to check it really in depth. The nil ptr dereference you posted in the issue, I never observed in my tests. I tried a while to make the algorithm work for having consistent results but I think I have to write a test tool to understand if we use the data sharing wrong or whatever.
Redis is quite reliable setup with 1 cpu usage and then use sharding to scale out. 1 redis can handle up to 40k rps, but it does not scale linearly by factor 1 but something like 0.7 (more feeling than data). In aws ec2 the isolation seems to be a bit weak for linearly scaling by factor 1.
thanks for filing the issue.
So self is nil and we should return an error in swarm.Join()
Created a PR to fix this minor bug.
@szuecs this can be marked as closed now.