Automatic cleaning of *Run objects
Feature request
Users are wondering how Tekton can automatically clean old Task/Pipeline/Runs objects. Currently they are staying inside the cluster and it's up to the users to clean them up.
https://github.com/tektoncd/pipeline/issues/2856 https://github.com/tektoncd/pipeline/issues/1334
The consensus is to wait for this kubernetes features coming hopefully 🤞🏽 in 1.21 :
https://github.com/kubernetes/enhancements/issues/592
There is some workaround, to use the --keep option to the delete command of tkn :
% tkn pipelinerun delete --help|grep keep
--keep int Keep n most recent number of PipelineRuns
Periodically run with a cronjob.
Another approach is a proposed (non merged) task :
https://github.com/tektoncd/catalog/pull/572
That can be installed by the namespace admin to cleanup objects by age or by number.
The biggest disadvantages (except if ttl expiring on objects can get setup on tekton controlers) of those solutions is that they are something that admin need to setup on every namespaces containing tekton objects.
I would propose to have this feature on the Operator.
The idea would be :
- Let the admin configure an expiration by date (ie: let's keep only the last five days) or by number (let's keep only the last five runs) globally.
- Let a configmap on a namespace overrides the global policy.
The operator would take care of cleaning the old runs belonging to a pipeline according to that policy.
Use case
As a user I have too many Runs spinning up on a cluster and I need to clean them up. Setting crons for every namespaces where I have a pipeline or task can get tedious and I want to automate this for all my users and easily changing the policy on a cluster level.
Implementation ideas
There is probably a few ways we may want to discuss on how to achieve this,
-
Installing a cronjob on every namespaces that has a tekton object. The cron would check if there is a configuration on the current namespace or globally. Cron runs the clean up task (which could be something like https://github.com/tektoncd/catalog/pull/572) according to those values.
-
Global cron in the operator namespace checking all over the namespace for pipeline/task, checking for configuration in the namespace or use the global configuration and apply a cleanup task on that namespace according to the configuration.
There is now a related TEP : TEP-0052 /cc @wlynch
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
/lifecycle frozen
/assign @pradeepitm12
Just to add two cents. There already is useful ttl mechanism operator - https://github.com/hjacobs/kube-janitor And we tried that thing, but the point is, TTL is pretty bad helper here. Because even if I set TTL to 2-3 days, my cluster will get flooded very, very fast - I've got thousands of pipelines, and they can run every hour. My favorite strategy for this situation is the way openshift buildconfigs do it - you can set .spec.successfulBuildsHistoryLimit and .spec.failedBuildsHistoryLimit and have all important info on runs always alive, not depending on how many builds you have and how frequently. We already have some "operator" written in python, that keeps just x last successfull/failed runs for each pipeline, but i would like it to be implemented as part of tekton.
Also, I would like to have option like ".spec.cleanRunObjectsOnDelete", to be able to be sure all run objects removed on pipeline deletion(that could be implemented through setting ownerReferences for run objects, pointing to original task/pipeline (https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). We also already have such feature implemented in out own mutating hook, but in my opinion, that is what tekton operator should also do itself.
We already have some "operator" written in python @dbazhal could you share it with us. it would be a great reference for us.
@pradeepitm12 @vdemeester
We already have some "operator" written in python @dbazhal could you share it with us. it would be a great reference for us.
not sure if i can do that - it's corporate, it's in async python, and it's tied to our inner crd's.
main logic is smth like this
from collections import defaultdict
from threading import Lock
from typing import Dict
from dateutil import parser
import jmespath
class GC:
def __init__(self, kube_client):
self.runs_per_pipeline : Dict[str, Dict[str, dict]] = defaultdict(dict)
self.client = kube_client
self.pipeline_run_resource = kube_client.resources.get(api_version='tekton.dev/v1alpha1', kind='PipelineRun')
self.struct_lock = Lock()
self.success_limit = 3
self.failed_limit = 1
def run(self):
for event in self.pipeline_run_resource.watch(namespace=''):
if event['type'] not in ('MODIFIED', 'ADDED'):
continue
run_object = event['object']
pipeline_name = run_object['spec'].get('pipelineRef', {}).get('name')
if not pipeline_name:
continue
namespace = run_object['metadata']['name']
pipeline_key = '{}/{}'.format(namespace, pipeline_name)
run_name = run_object['metadata']['name']
with self.struct_lock.acquire():
self.runs_per_pipeline[pipeline_key][run_name] = run_object
self.clean(namespace, pipeline_name)
def clean(self, namespace, pipeline_name):
to_delete = list()
succeeded = list()
failed = list()
pipeline_key = '{}/{}'.format(namespace, pipeline_name)
run_objects = list(self.runs_per_pipeline[pipeline_key].values())
for single_run in run_objects:
if jmespath.search("status.conditions[?type == 'Succeeded']|[?status == 'False']", single_run):
failed.append(single_run)
elif jmespath.search("status.conditions[?type == 'Succeeded']|[?status == 'True']", single_run):
succeeded.append(single_run)
if len(succeeded) > self.success_limit:
to_delete.extend(
sorted(
succeeded, key=lambda one_run: parser.isoparse(one_run['metadata']['creationTimestamp'])
)[:-self.success_limit]
)
if len(failed) > self.failed_limit:
to_delete.extend(
sorted(
failed, key=lambda one_run: parser.isoparse(one_run['metadata']['creationTimestamp'])
)[:-self.failed_limit]
)
for to_delete_run in to_delete:
run_name = to_delete_run['metadata']['name']
with self.struct_lock.acquire():
del self.runs_per_pipeline[pipeline_key][run_name]
self.pipeline_run_resource.delete(namespace=namespace, name=run_name)
not sure if that can help
Also i remembered nice example of how run objects pruning should be handled and it is openshift build operator handling builds cleanup for buildconfigs(straight analogy for pipelines and pipelineruns). https://docs.okd.io/latest/cicd/builds/advanced-build-operations.html#builds-build-pruning_advanced-build-operations
this is how build(run) handles it's objects prune: https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/common/util.go#L40
and here is when it's triggered:
on build(run) handling https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/build/build_controller.go#L573
on build(run) completion https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/build/build_controller.go#L1560
ob buildconfig(pipeline/task) processing: https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/buildconfig/buildconfig_controller.go#L102
I also have many pipelineruns that need to be cleaned up, do I still need to clean up manually every time now?
We have a feature called pruner implemented while ago. I see the requirement from this issue is addressed there. Please feel free to reopen the issue, if something still needs to be addressed here.