worker icon indicating copy to clipboard operation
worker copied to clipboard

feat(k8s): Make RunContainer detect/report more error conditions like image pull errors

Open cognifloyd opened this issue 3 years ago • 2 comments

This is an alternate implementation of #279 using the SharedInformer infra to watch events introduced in #519. Closes #279

Overview

The Docker Runtime surfaces image pull errors naturally because of the blocking synchronous requests to pull an image or start a container.

The Kubernetes Runtime has to watch for events that show any errors with pulling images.

RunContainer changes in k8s runtime

Wait until one of these conditions before returning:

  • build canceled
  • container is running
  • got an image pull error

WaitContainer changes in k8s runtime

Wait until one of these conditions before returning:

  • build canceled (new)
  • container is terminated (was already doing this)

containerTracker additions

I added signal channels for Running and ImagePulled. I also added a channel for receiving image pull errors in RunContainer.

podTracker changes

I need pod Name and Namespace in more places, so I copy those into the podTracker now.

I was surprised that the SharedInformerFactory I used to create the Informer/Lister for watching the build Pod did NOT work for watching events. Apparently, events don't get labeled, so I had to add a separate eventInformerFactory that uses a FieldSelector instead of a LabelSelector when making the list/watch API calls.

I did not rename informerFactory to podInformerFactory because it can easily be used for other k8s resources in the future (if needed), just not for Events.

I also added the Event event handler funcs (wow that was a mouth full - events about Events).

podTracker.inspectContainerEvent() (called by the Event event handlers) is responsible for

  • sending the image pull events to RunContainer
  • closing the ImagePulled signal channel when an event indicates that the pull was successful

I also adjusted podTracker.inspectContainerStatuses() so that it can also:

  • surface image pull errors (in case the error is visible there, but not through the Events API - which is unlikely, but there doesn't seem to be any kind of guarantees about event availability, so this covers our bases)
  • close the Running signal channel when the container status shows that it is running.

TODO:

  • [x] merge #302
  • [x] rebase
  • [ ] write tests

cognifloyd avatar Apr 30 '22 00:04 cognifloyd

Codecov Report

Merging #331 (65cbe59) into master (a9cbe48) will increase coverage by 1.01%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #331      +/-   ##
==========================================
+ Coverage   81.18%   82.20%   +1.01%     
==========================================
  Files          68       68              
  Lines        5273     5472     +199     
==========================================
+ Hits         4281     4498     +217     
+ Misses        848      831      -17     
+ Partials      144      143       -1     
Impacted Files Coverage Δ
runtime/kubernetes/build.go 88.60% <100.00%> (+1.55%) :arrow_up:
runtime/kubernetes/container.go 91.64% <100.00%> (+9.61%) :arrow_up:
runtime/kubernetes/pod_tracker.go 97.37% <100.00%> (+1.51%) :arrow_up:

codecov[bot] avatar May 03 '22 20:05 codecov[bot]

I'm running into some issues with this, so I'm moving it back to draft for now.

cognifloyd avatar May 07 '22 17:05 cognifloyd