Allow default configuration override with annotations
It would be nice to be able to override default pod-reaper settings with annotations.
For example if MAX_DURATION=1d, but a pod had the annotation pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.
@JordanSussman Would you propose that the pod defined label(s) should always override the pod-reaper defined behavior? Feels like a minimum the default behavior should be to ignore pod defined labels unless a configuration option has been defined on pod-reaper. Suppose you could also go as granular as to define certain "whitelist" of options that pods are allowed to override.
For simplicity, I think annotations should always override the default configuration if present.
A few thoughts I had on this before I dive into implementation.
There are two kinds of environment variables used to configure pod-reaper. One set is specific to the reaper itself (how often it looks at pods, what namespaces it looks at, etc). The other set defines the rules -- these are the things where we can have pods specify their own overrides.
Overriding rules on a per pod basis means that those rules wouldn't know anything about the first set of rules (this is particularly important for things like random chance -- where checking more/less often matters)
There's another kind of interesting case. Suppose we have a reaper checking every 1 hour for anything that's been running in a specific namespace for > 3 days -- but a pod in that namespace using an annotation to specify an unrelated rule -- like container status is in ImagePullBackoff. That's not really an override, but a declaration of another rule. Would we want this reaper to honor that "new" rule?
Right now, I'm thinking of the following:
- add a configurable environment variable/rule to honor rules specified in pod annotations
- make sure that rules can load values from pod annotations
Now for some fun stuff!
What do we want the behaviors to be in the following situations?
- A rule is defined on the reaper, and the same rule on a specific pods Assuming that we're honoring pod annotations, this feels like like the pod should override
- A rule is defined on the reaper that is not defined on the pod Default behavior of the reaper
- a rule is defined on the pod that is not defined in the reaper
This is where things get dicey. Should a reaper honor a rule specification on a pod when it doesn't have a configuration for that rule? To potentially complicate it even more -- if it should honor them, then do we
andthe pod specific rules with the other rules defined by the reaper?
So let's say we have pod-reaper/maxduration: 12h annotation on a pod, but the reaper itself isn't configured with any value for MAX_DURATION -- should the reaper honor the annotation? If the answer is "no", then I think this much more simple, but also less powerful.
Additional thought: this really brings to a light a problem with the chaos chance rule because if it's specified as a pod annotation, then there's a separation between where the chance is defined and how often that chance is applied (the reaper's scheduling) which is just begging for trouble. @andrewwipf and I actually did some white-boarding a couple solutions to that problem (that I think I'd like to implement independently of this feature).
Should a reaper honor a rule specification on a pod when it doesn't have a configuration for that rule?
My initial idea was “no” to keep things simple. However that will probably lead to confusion.
I think the following pseudo code will handle all three cases:
rules := loadAllRules()
foreach rule in rules {
pods := getAllPodsInNs()
foreach pod in pods{
setting := getEnv(“DEFAULT_SETTING”)
annotation = getAnnotation(“key”)
if annotation != nil
setting = annotation.value
if setting = nil
continue
shouldDelete := evaluateRule(setting)
if shouldDelete
delete(pod)
}
}
I agree with that overall approach! Weirdly enough, it's more refactoring than I would have originally thought -- but I'm game! I'll work on a PR for this (hopefully over the weekend here)!
Worked on this a bit. Doing this in a few steps (in part because I'm a lunatic that likes refactoring). I'll be updating a rule-refactor branch as I go.
Originally, with rules only configured at the reaper level, only the rules that were configured were "loaded". This separated the configuration of the reaper (which happened exactly once at startup) from the actual checking of the rules. If we want to enable overrides from things configured outside the reaper itself, rule configuration needs to be more dynamic. To accomplish that, I want to remove the loading phase of this altogether, and instead always be checking each rule -- letting the rule determine if it should be applied or not.
This leads to a ternary situation. There are now three cases for a rule evaluation:
- reap the pod
- spare the pod
- ignore this rule (it's not configured)
We avoided case 3 originally because rules that weren't configured were never loaded. Here's the priority chain to keep everything consistent:
- if any rule saying "spare" --> the pod will not be reaped
- else if there are any number rules saying "reap" --> the pod will be reaped
- if NO rules say "reap" the pod is spared
So far, I think this is going well. Just taking me a longer than I wanted. I really convinced myself that this could be a fun feature when I thought the use case where rather than overriding settings, instead there was a generic reaper running with no-environment variable defined rules. Instead it's an entirely "opt-in with your own per pod rules".
One more minor side note: chaos chance does NOT work well with a pod level override. It really needs a time factor to be configured WITH the chance in order to make sense. I have a couple ideas about how to do that, but it'll be separate from this work!
update: have a working reaper where rules are evaluated on the ternary logic of {reap, spare, ignore} as well as having implemented one rule in this way. From here the next step is really simple: tweaking the rules to match the changed interface.
After all this, adding pod-level overrides will be straightforward!
Okay. Feeling pretty good about refactoring how rules work in order to help accommodate this request. For right now, the rework should have functional feature parity (I have changed log messages a bit).
There are still a few things I want to test out with this, but it should be a total drop-in-place of an older version.
Docker container: brianberzins/pod-reaper:alpha-02212020
From here, the request to add pod-level overrides should be MUCH easier than before.
I'm also a pretty big fan of how the rework simplified what a "rule" is away from being a structure and instead being type rule func(v1.Pod) (result, string)
#45 has a work-in-progress of a rule refactor, but I feel like I'm hitting a point where "it doesn't feel right" again.
For a rules to have multiple sources of values leaves me with a lot of questions about what pod-reaper looks like it should be doing from the multiple perspectives. So I'm going to try to write out a super specific situation and see what people think. Using the example above!
Suppose we have the following:
Pod reaper is configured with:
-
MAX_DURATION=1d
A specific pod has the following annotations:
-
pod-reaper/max_duration: 12h -
pod-reaper/pod_status: Evicted
What do we think it should do when it encounters a pod that has been, and is still running (not Evicted) at 12h1m?
If I only saw the pod-reaper configuration, I would describe the reaper as "kills pods that have been running for more than 1 day" so I would say "don't kill the pod".
If I only saw the pod annotations, I would describe the kill conditions as "the pod is at least 12 hours old AND has been evicted" so I would say "don't kill the pod"
If we honor overrides, but only for rules that have been configured (or if we opted to do something like ENABLE_RULE=max_duration and something like HONOR_POD_ANNOTATION_OVERRIDES=true then I would say "only max duration is enabled, ignore the pod_status annotation, and kill the pod". You would really need to see both sides (reaper and pod configuration) for this to make sense.
Maybe that "AND" there is what's really causing my confusion here.
If you want to "OR" two rules, you can just make two reapers (one with each rule) and then if either one says reap, the pod is killed. Pod annotations don't really make sense in this context, as they'd apply to any/all reapers that are looking for them.
Maybe the solution here is to have two specific configuration options here:
- explicitly enabling rules, and treating pod annotations as overrides to those specific values
- ignore all local rules, and honor the set of annotations on a pod as a collection of rules
At this point it sounds like I'm overthinking things... but what I've done for #45 doesn't really make sense for the first of these (which I think is more in line with the topic of this issue). It also doesn't really make sense of the second of these -- but for different reasons. As much as I like the idea of the second case, it really feels like a super specific use case that I shouldn't consider right now.
Going to keep thinking about this.
In general, I like the direction of the refactor. The rules returning a named result instead of a boolean, makes things much clearer what the expected behavior is without having to jump around in the code.
I liked
- if any rule saying "spare" --> the pod will not be reaped
- else if there are any number rules saying "reap" --> the pod will be reaped
- if NO rules say "reap" the pod is spared
I would add 4. Pod annotations will enable that rule on a pod and override global configuration (if any).
For the Duration example above, since there is an annotation on the pod, I would expect it to be evicted at 12h1m
Consider the reverse scenario:
-
MAX_DURATION=12h
Pod annotations:
-
pod-reaper/max_duration: 1d
In this case, I would expect the pod to be killed in 1 day, even though the default is 12h.
I do think explicitly enabling rules would probably lead to the least amount of confusion, but would break backward compatibility with the current configuration. I would release it as v3.0 to indicate that there are breaking changes, and clearly document it.
Alright. Way more thought than I'd have liked later, here's my plan.
- move forward with #45 (after any cleanup people would like to see done)
- add in the ability to explicitly define which rules are used (if unset, "enabling all" wouldn't break backwards comparability -- all for discussion there)
- implement pod level overrides as simply as I can think of.
For implementing pod level overrides.
- add a flag to honor pod level annotations
- annotations act as overrides (or initial configuration) for any/all enabled rule
- add in some level of abstraction for getting the pod-reaper configuration key (env variable), override annotation, and name (for use when explicitly enabling rules)
The goal of the last piece is to reduce code duplication and make it easier to work. I'm not exactly sure how that'll look right now, but they idea would be start with a list of all rules, filter out anything not enabled, then provide common methods for getting their configuration value.
Something like func(r *rule) getConfigValue(p v1.Pod) (configured cool, value string)
which can be generalized for all rules.
@brianberzins Have you made any more progress on annotations? I just ran into a situation where they would be very useful. I'm willing to lend a hand getting this implemented.
I need dive back into this! It's looking like I was moving towards a very different setup with #45 that would make this a lot easier, but there have been a couple of things since that I'd need to make sure still work as expected.
I had an immediate need, so I went ahead and implemented this in #64. I'm open to changes.
I owe you a code review! Hoping I can get to it this evening!
On Wed, Jan 20, 2021, 9:32 AM Jason Harmon [email protected] wrote:
I had an immediate need, so I went ahead and implemented this in #64 https://github.com/target/pod-reaper/pull/64. I'm open to changes.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/target/pod-reaper/issues/44#issuecomment-763712519, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS2AGHVS6GQ54N5QTFHH3TS23ZRJANCNFSM4KUTKXSQ .