operator icon indicating copy to clipboard operation
operator copied to clipboard

Automatic cleaning of *Run objects

Open chmouel opened this issue 5 years ago • 11 comments

Feature request

Users are wondering how Tekton can automatically clean old Task/Pipeline/Runs objects. Currently they are staying inside the cluster and it's up to the users to clean them up.

https://github.com/tektoncd/pipeline/issues/2856 https://github.com/tektoncd/pipeline/issues/1334

The consensus is to wait for this kubernetes features coming hopefully 🤞🏽 in 1.21 :

https://github.com/kubernetes/enhancements/issues/592

There is some workaround, to use the --keep option to the delete command of tkn :

% tkn pipelinerun delete --help|grep keep
      --keep int                      Keep n most recent number of PipelineRuns

Periodically run with a cronjob.

Another approach is a proposed (non merged) task :

https://github.com/tektoncd/catalog/pull/572

That can be installed by the namespace admin to cleanup objects by age or by number.

The biggest disadvantages (except if ttl expiring on objects can get setup on tekton controlers) of those solutions is that they are something that admin need to setup on every namespaces containing tekton objects.

I would propose to have this feature on the Operator.

The idea would be :

  • Let the admin configure an expiration by date (ie: let's keep only the last five days) or by number (let's keep only the last five runs) globally.
  • Let a configmap on a namespace overrides the global policy.

The operator would take care of cleaning the old runs belonging to a pipeline according to that policy.

Use case

As a user I have too many Runs spinning up on a cluster and I need to clean them up. Setting crons for every namespaces where I have a pipeline or task can get tedious and I want to automate this for all my users and easily changing the policy on a cluster level.

Implementation ideas

There is probably a few ways we may want to discuss on how to achieve this,

  1. Installing a cronjob on every namespaces that has a tekton object. The cron would check if there is a configuration on the current namespace or globally. Cron runs the clean up task (which could be something like https://github.com/tektoncd/catalog/pull/572) according to those values.

  2. Global cron in the operator namespace checking all over the namespace for pipeline/task, checking for configuration in the namespace or use the global configuration and apply a cleanup task on that namespace according to the configuration.

chmouel avatar Jan 08 '21 13:01 chmouel

There is now a related TEP : TEP-0052 /cc @wlynch

vdemeester avatar Feb 25 '21 05:02 vdemeester

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar May 26 '21 06:05 tekton-robot

/lifecycle frozen

nikhil-thomas avatar May 26 '21 06:05 nikhil-thomas

/assign @pradeepitm12

nikhil-thomas avatar May 26 '21 06:05 nikhil-thomas

Just to add two cents. There already is useful ttl mechanism operator - https://github.com/hjacobs/kube-janitor And we tried that thing, but the point is, TTL is pretty bad helper here. Because even if I set TTL to 2-3 days, my cluster will get flooded very, very fast - I've got thousands of pipelines, and they can run every hour. My favorite strategy for this situation is the way openshift buildconfigs do it - you can set .spec.successfulBuildsHistoryLimit and .spec.failedBuildsHistoryLimit and have all important info on runs always alive, not depending on how many builds you have and how frequently. We already have some "operator" written in python, that keeps just x last successfull/failed runs for each pipeline, but i would like it to be implemented as part of tekton.

dbazhal avatar Jun 01 '21 13:06 dbazhal

Also, I would like to have option like ".spec.cleanRunObjectsOnDelete", to be able to be sure all run objects removed on pipeline deletion(that could be implemented through setting ownerReferences for run objects, pointing to original task/pipeline (https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents). We also already have such feature implemented in out own mutating hook, but in my opinion, that is what tekton operator should also do itself.

dbazhal avatar Jun 01 '21 13:06 dbazhal

We already have some "operator" written in python @dbazhal could you share it with us. it would be a great reference for us.

nikhil-thomas avatar Jun 03 '21 09:06 nikhil-thomas

@pradeepitm12 @vdemeester

nikhil-thomas avatar Jun 03 '21 09:06 nikhil-thomas

We already have some "operator" written in python @dbazhal could you share it with us. it would be a great reference for us.

not sure if i can do that - it's corporate, it's in async python, and it's tied to our inner crd's.

main logic is smth like this

from collections import defaultdict
from threading import Lock
from typing import Dict
from dateutil import parser

import jmespath


class GC:
    def __init__(self, kube_client):
        self.runs_per_pipeline : Dict[str, Dict[str, dict]] = defaultdict(dict)
        self.client = kube_client
        self.pipeline_run_resource = kube_client.resources.get(api_version='tekton.dev/v1alpha1', kind='PipelineRun')
        self.struct_lock = Lock()

        self.success_limit = 3
        self.failed_limit = 1

    def run(self):
        for event in self.pipeline_run_resource.watch(namespace=''):
            if event['type'] not in ('MODIFIED', 'ADDED'):
                continue
            run_object = event['object']
            pipeline_name = run_object['spec'].get('pipelineRef', {}).get('name')
            if not pipeline_name:
                continue
            namespace = run_object['metadata']['name']
            pipeline_key = '{}/{}'.format(namespace, pipeline_name)
            run_name = run_object['metadata']['name']
            with self.struct_lock.acquire():
                self.runs_per_pipeline[pipeline_key][run_name] = run_object
            self.clean(namespace, pipeline_name)

    def clean(self, namespace, pipeline_name):
        to_delete = list()
        succeeded = list()
        failed = list()
        pipeline_key = '{}/{}'.format(namespace, pipeline_name)
        run_objects = list(self.runs_per_pipeline[pipeline_key].values())
        for single_run in run_objects:
            if jmespath.search("status.conditions[?type == 'Succeeded']|[?status == 'False']", single_run):
                failed.append(single_run)
            elif jmespath.search("status.conditions[?type == 'Succeeded']|[?status == 'True']", single_run):
                succeeded.append(single_run)

        if len(succeeded) > self.success_limit:
            to_delete.extend(
                sorted(
                    succeeded, key=lambda one_run: parser.isoparse(one_run['metadata']['creationTimestamp'])
                )[:-self.success_limit]
            )
        if len(failed) > self.failed_limit:
            to_delete.extend(
                sorted(
                    failed, key=lambda one_run: parser.isoparse(one_run['metadata']['creationTimestamp'])
                )[:-self.failed_limit]
            )

        for to_delete_run in to_delete:
            run_name = to_delete_run['metadata']['name']
            with self.struct_lock.acquire():
                del self.runs_per_pipeline[pipeline_key][run_name]
            self.pipeline_run_resource.delete(namespace=namespace, name=run_name)

not sure if that can help

dbazhal avatar Jun 10 '21 13:06 dbazhal

Also i remembered nice example of how run objects pruning should be handled and it is openshift build operator handling builds cleanup for buildconfigs(straight analogy for pipelines and pipelineruns). https://docs.okd.io/latest/cicd/builds/advanced-build-operations.html#builds-build-pruning_advanced-build-operations

this is how build(run) handles it's objects prune: https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/common/util.go#L40

and here is when it's triggered:

on build(run) handling https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/build/build_controller.go#L573

on build(run) completion https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/build/build_controller.go#L1560

ob buildconfig(pipeline/task) processing: https://github.com/openshift/openshift-controller-manager/blob/461fe64e30847a5ae9c361500d7434d2f1756de2/pkg/build/controller/buildconfig/buildconfig_controller.go#L102

dbazhal avatar Jul 08 '21 12:07 dbazhal

I also have many pipelineruns that need to be cleaned up, do I still need to clean up manually every time now?

linmuqin avatar Mar 17 '22 10:03 linmuqin

We have a feature called pruner implemented while ago. I see the requirement from this issue is addressed there. Please feel free to reopen the issue, if something still needs to be addressed here.

jkandasa avatar Jun 06 '23 15:06 jkandasa