eventing icon indicating copy to clipboard operation
eventing copied to clipboard

Triggering long running background jobs when events occurs

Open pierDipi opened this issue 1 year ago • 9 comments

Problem

Usually event processing combined with a Knative Service is expected to complete in a relative short period of time (minutes) as it requires the HTTP connection to stay open as otherwise the service is scaled down and keeping long running connections open increases the possibility of failing and so the processing needs to restart as the request is retried.

This limitation is not ideal, therefore providing a resource (JobSink) which will trigger a long running job when an event occurs might be a good alternative vehicle.

Example JobSink API:

apiVersion: sinks.knative.dev/v1alpha1
kind: JobSink
metadata:
  name: job-sink-success
spec:
  job:
    apiVersion: batch/v1
    kind: Job
    spec:
      completions: 12
      parallelism: 3
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              image: docker.io/library/bash:5
              command: [ "bash" ] 
              args:
                - -c
                - echo "Hello world!" && sleep 5
      backoffLimit: 6
      podFailurePolicy:
        rules:
          - action: FailJob
            onExitCodes:
              containerName: main      # optional
              operator: In             # one of: In, NotIn
              values: [ 42 ]
          - action: Ignore             # one of: Ignore, FailJob, Count
            onPodConditions:
              - type: DisruptionTarget   # indicates Pod disruption

Possible Client UX

We could provide kn client UX which would take a k8s job and turn it into a jobsink as creating a Job might be easier with existing IDE tooling:

kn jobsink create --from-job-file=job.yaml -oyaml > jobsink.yaml

Persona: Which persona is this feature for?

Developers

Exit Criteria A measurable (binary) test that would indicate that the problem has been resolved.

Time Estimate (optional): How many developer-days do you think this may take to resolve?

10

Additional context (optional) Add any other context about the feature request here.

pierDipi avatar Feb 28 '24 11:02 pierDipi

/assign

fsedano avatar Mar 22 '24 02:03 fsedano

@pierDipi as discussed in Kubecon -- This is my main use case, so I'd like to work on solving this issue.

fsedano avatar Mar 22 '24 02:03 fsedano

Thanks @fsedano feel free to go ahead, here is my PoC branch on my fork https://github.com/pierDipi/eventing/tree/jobsink

pierDipi avatar Mar 25 '24 10:03 pierDipi

/triage accepted

pierDipi avatar Mar 25 '24 10:03 pierDipi

Before working on this, I think we should fix an underlying issue being discussed here:

https://github.com/knative/serving/issues/13075

Because of this, the premise "as it requires the HTTP connection to stay open as otherwise the service is scaled down " is not 100% true -- Keeping the HTTP open won't guarantee the pod is not killed.

fsedano avatar Apr 09 '24 17:04 fsedano

@fsedano I think the idea of this proposal was to create a k8s Job resource, instead of a Knative service (correct me if I'm wrong @pierDipi ). That way, this limitation would not be an issue

Cali0707 avatar Apr 11 '24 19:04 Cali0707

Yes, that's what the POC also does, it creates a job for each received event. A received event is mounted as a JSON file volume to each job

pierDipi avatar Apr 12 '24 14:04 pierDipi

@fsedano, Are you still actively working on this feature? If so, I'd like to know your timeframe.

I see a lot of value in this feature and am trying to determine whether it is worth waiting for its release or writing a one-off job handler for a current use case.

mroberts91 avatar May 20 '24 13:05 mroberts91

@mroberts91 before working on this I want to handle the regular UX described here: https://github.com/knative/serving/issues/13075 (which is impacting my use case)

Would that also help you? Or do you need the job route as discussed here?

fsedano avatar May 22 '24 08:05 fsedano

I've added JobSink in the PR here https://github.com/knative/eventing/pull/7954 and it will be available in the next release

pierDipi avatar Jun 04 '24 11:06 pierDipi

Documentation PR: https://github.com/knative/docs/pull/6005

@fsedano @mroberts91 it would be great if you can review the JobSink documentation PR as it can help us to get early feedback, thanks!

pierDipi avatar Jun 04 '24 16:06 pierDipi

@pierDipi @fsedano great work!

Two questions I had:

  1. It doesn't seem like there is functionality to set min_replicas? I understand this is mimicking the job primitive so min replicas doesn't really exist but it still would be useful for jobs to start straight away it takes time to instantiate
  2. Do you have any idea when the pre-release will go out? Seems releases happened ~every 2 weeks

milo157 avatar Jun 12 '24 13:06 milo157

Hi @milo157

It doesn't seem like there is functionality to set min_replicas? I understand this is mimicking the job primitive so min replicas doesn't really exist but it still would be useful for jobs to start straight away it takes time to instantiate

This is interesting, would you mind opening an issue with more details? I'm not sure I fully follow what's your use case or where do you see the latency you've hinted at with

it takes time to instantiate

The way JobSink works in the first release is that there is a always-running jobsink dispatcher (in the knative-eventing namespace) receiving all requests for all jobsinks and then it creates a Job as soon as an event is received, so I don't expect very high latency

Do you have any idea when the pre-release will go out? Seems releases happened ~every 2 weeks

We just released the 1.15 release, which includes JobSink, we're currently working on the release notes and the annoucement but you can follow the regular installation instructions using the 1.15 artifacts to try it out in the meantime

pierDipi avatar Jul 24 '24 07:07 pierDipi

Hi @pierDipi

Sure I can make a new issue however before I do that I just want to confirm the functionality is not available.

The use case is you have a long running task that has a high initialisation time (think loading in python imports, a large GPU based model etc). In order to avoid the high initialisation times and get feedback from your long running task sooner, you want a pod ready and waiting that has already initialised ie: min_replica 1.

Let me know if i need to clarify further?

milo157 avatar Jul 24 '24 12:07 milo157