skipper icon indicating copy to clipboard operation
skipper copied to clipboard

panic on rolling out SWARM

Open universam1 opened this issue 3 years ago • 2 comments

Describe the bug Upon rolling out deployment changes enabling swarm mode, the new pods error out with

[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=error msg="SWARM: failed to parse the ip "
[APP]time="2022-05-09T12:46:57Z" level=info msg="SWARM: <nil> is going to join swarm of 9 nodes ([NodeInfo{name: skipper-ingress-5cc58d7889-c4zwf, 172.16.7.27:9990} NodeInfo{name: skipper-ingress-5cc58d7889-n58f4, 172.16.12.158:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qbzt5, 172.16.15.209:9990} NodeInfo{name: skipper-ingress-5cc58d7889-qw6zk, 172.16.7.77:9990} NodeInfo{name: skipper-ingress-5cc58d7889-v828t, 172.16.23.212:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-9n8vg, 172.16.7.133:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-pjc6r, 172.16.20.170:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-q98fx, 172.16.23.96:9990} NodeInfo{name: skipper-ingress-6cfb658b7d-vt5nl, 172.16.12.240:9990}])"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xe9e9cf]

goroutine 1 [running]:
github.com/zalando/skipper/swarm.Join({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0}, ...)
    /workspace/swarm/swarm.go:209 +0x14f
github.com/zalando/skipper/swarm.Start({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
    /workspace/swarm/swarm.go:197 +0x18e
github.com/zalando/skipper/swarm.newKubernetesSwarm({0x0, 0x400000, 0x12a05f200, 0x2706, 0xc00008d4a0, 0x0, 0x0, {0x0, 0x0}, 0x0})
    /workspace/swarm/swarm.go:189 +0x3e5
github.com/zalando/skipper/swarm.NewSwarm(0xc0000a6000?)
    /workspace/swarm/swarm.go:130 +0x1d8
github.com/zalando/skipper.run({0xa7a358200, {0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x7ffd3100bb2b, 0x5}, 0x0, ...}, ...)
    /workspace/skipper.go:1421 +0x50e5
github.com/zalando/skipper.Run(...)
    /workspace/skipper.go:1720
main.main()
    /workspace/cmd/skipper/main.go:47 +0x12e

To Reproduce Add context, for example:

  1. existing Skipper deployment
  2. update deployment to enable SWIM: https://opensource.zalando.com/skipper/kubernetes/ingress-controller/#swim-based
  3. pods crash until the 2nd last pod is rolled

Expected behavior exclude Skipper pods of the SWARM list that are not yet enabled

universam1 avatar May 09 '22 14:05 universam1

This was answered in chat, for other people I post here the message:

 we don't use it and for now have no time to check it really in depth. The nil ptr dereference you posted in the issue, I never observed in my tests. I tried a while to make the algorithm work for having consistent results but I think I have to write a test tool to understand if we use the data sharing wrong or whatever.
Redis is quite reliable setup with 1 cpu usage and then use sharding to scale out. 1 redis can handle up to 40k rps, but it does not scale linearly by factor 1 but something like 0.7 (more feeling than data). In aws ec2 the isolation seems to be a bit weak for linearly scaling by factor 1.

thanks for filing the issue.

szuecs avatar May 11 '22 20:05 szuecs

So self is nil and we should return an error in swarm.Join()

szuecs avatar May 11 '22 20:05 szuecs

Created a PR to fix this minor bug.

demonCoder95 avatar Oct 26 '22 21:10 demonCoder95

@szuecs this can be marked as closed now.

demonCoder95 avatar Oct 27 '22 16:10 demonCoder95