dispatch icon indicating copy to clipboard operation
dispatch copied to clipboard

Image manager gets restarted on 1vCPU k8s nodes

Open kars7e opened this issue 7 years ago • 1 comments

This was discovered when working on #213. My initial guess was that it's related to docker pull errors, but it turned out the root cause is different. When running on 1vCPU k8s node, the sync operation starves other goroutines, including the healthz handler, causing the entire pod to be restarted by kubernetes. this in turns causes weird error messages, and fails e2e tests. The workaround is to use larger k8s nodes, but it's worth investigating if we can make it work reliably on 1vCPU as well.

  • [x] Bug
  • [ ] Feature
  • [ ] Enhancement

Detailed Description

I was seeing a lot of issues and couldn't get a stable e2e run on GKE until I increased the size of cluster nodes from 1vCPU to 2vCPUs. after that, everything works flawlessly.

On 1vCPU, I was seeing very weird errors, including an error in image manager. image manager pod was getting restarted because of health check failure. It seems like when sync was in progress, the healthz endpoint was not responding - which would indicate issues with concurrency on 1vCPU. Seems like go scheduler gets GOMAXPROCS set to 1, and because of that the healthz handler goroutine will not get scheduled until sync is completed (and it does not get completed because it gets killed by k8s). We might want to investigate it further and see if we can improve the situation on 1vCPU-based nodes (e.g. using Gosched()).

Here is image manager log:

time="2018-02-10T18:17:01Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run:205]"
time="2018-02-10T18:17:01Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run.func1:214]"
2018/02/10 18:17:01 Serving image manager at http://[::]:80
2018/02/10 18:17:01 Serving image manager at https://[::]:443
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run.func2.1:236]"
time="2018-02-10T18:17:11Z" level=info msg="periodic syncing with the underlying driver"
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).sync:186]"
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*baseImageEntityHandler).Sync:77]"
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*BaseImageBuilder).baseImageStatus:176]"
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*BaseImageBuilder).baseImageStatus:176] [8.262664ms] "
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*baseImageEntityHandler).Sync:77] [8.347663ms] "
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*imageEntityHandler).Sync:148]"
time="2018-02-10T18:17:11Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.DockerImageStatus:125]"
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.DockerImageStatus:125] [1.731903ms] "
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*imageEntityHandler).Sync:148] [3.746689ms] "
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).sync:186] [12.220319ms] "
time="2018-02-10T18:17:11Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run.func2.1:236] [12.601615ms] "
2018/02/10 18:17:19 shutdown initiated
2018/02/10 18:17:19 shutdown initiated
2018/02/10 18:17:19 Stopped serving image manager at https://[::]:443
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run.func2.1:236]"
time="2018-02-10T18:17:21Z" level=info msg="periodic syncing with the underlying driver"
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).sync:186]"
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*baseImageEntityHandler).Sync:77]"
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*BaseImageBuilder).baseImageStatus:176]"
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*BaseImageBuilder).baseImageStatus:176] [5.171318ms] "
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*baseImageEntityHandler).Sync:77] [5.267919ms] "
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.(*imageEntityHandler).Sync:148]"
time="2018-02-10T18:17:21Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/image-manager.DockerImageStatus:125]"
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.DockerImageStatus:125] [905.919µs] "
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/image-manager.(*imageEntityHandler).Sync:148] [4.644785ms] "
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).sync:186] [9.976817ms] "
time="2018-02-10T18:17:21Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).run.func2.1:236] [14.139172ms] "
2018/02/10 18:17:29 Stopped serving image manager at http://[::]:80
time="2018-02-10T18:17:29Z" level=debug msg="[BEGIN] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).Shutdown:103]"
time="2018-02-10T18:17:29Z" level=debug msg="[END  ] [github.com/vmware/dispatch/pkg/controller.(*DefaultController).Shutdown:103] [114.016µs] "

kars7e avatar Feb 10 '18 01:02 kars7e

/subscribe

alexellis avatar May 17 '18 19:05 alexellis