eventing icon indicating copy to clipboard operation
eventing copied to clipboard

File Descriptor Leak in PingSource

Open liuyuxuan0723 opened this issue 1 year ago • 4 comments

Problem Description:

I followed the documentation https://knative.dev/docs/eventing/sources/ping-source/ to deploy and create several PingSources to trigger my functions. After running for a few days, all triggers stopped working. The logs from pingsource-mt-adapter are as follows:

2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.737Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task <nil>} {id 15 0 1118f752-e468-4fa4-8971-11b634b92413 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}

2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21248-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down <nil>} {id 15 0 5e1acd95-26a3-44f1-a25f-2630609bc3a4 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}
2024-09-09 10:03:03	
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0  Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21228-2195be836486b1a6 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/ <nil>} {id 15 0 158a4a33-56f4-4929-ad32-144035504ee8 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}

Monitoring Information:

  • File Descriptor Monitoring: image-20240909112639239
  • CPU Monitoring: image-20240909112739582
  • Memory Monitoring: image-20240909112717317

Environment:

  • Knative release version: 1.11.2
  • k8s version: 1.24.0

Expected Behavior:

The pingsource-mt-adapter should run normally, and there should be no leaks.

liuyuxuan0723 avatar Sep 09 '24 03:09 liuyuxuan0723

I cannot reproduce the problem with "normally" behaving functions.

Trying it with the current Fission 1.20.4 , and eventing 1.11 installed from the Serverless Operator 1.32.2 on OpenShift

Creating a hello-js from the fission examples

fission env create --name nodejs --image ghcr.io/fission/node-env
curl -LO https://raw.githubusercontent.com/fission/examples/main/nodejs/hello.js
fission function create --name hello-js --env nodejs --code hello.js
fission route create --function hello-js --url /hello-js --name hello-js --method GET --method POST

and having several PingSources like

apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  name: test-ping-source-1
  namespace: foobar
spec:
  contentType: application/json
  data: '{"message": "Hello world from 1!"}'
  schedule: '*/1 * * * *'
  sink:
    uri: http://router.fission.svc.cluster.local/hello-js
  timezone: Asia/Shanghai

shows only a stable number of open TCP connections.

I can reproduce the problem, kind of, if I add very long timeouts to the function

module.exports = async function(context) {
    await new Promise(r => setTimeout(r, 518400000));
    return {
        status: 200,
        body: "hello, world!\n"
    };
}

and adding a fntimeout to the fission function to override the default 60s

fission function create --name hello-js --env nodejs --code hello.js --fntimeout 604800

making the functions to never return does show that the pingsource does not have any timeouts on its own and the number of open TCP connections would be steadily rising if the sink would keep the connections open indefinitely and never return an HTTP response.

Can you see the number of open files rising on the fission side as well? (I'd expect the similar number of open files on the fission router)

maschmid avatar Sep 13 '24 19:09 maschmid

@maschmid I have checked the file descriptors of the Fission router and indeed encountered the same issue. The regular trigger interval being shorter than the function execution time does lead to a continuous increase in the number of open file descriptors.

Fission router monitoring: image

Currently, I have deployed about 4 or 5 functions, all with a timeout set to 60 seconds. The trigger interval is set to once per minute. I haven't observed any long-running executions or cases where an HTTP response is not returned. However, quite a few functions returned a 504 due to execution timeouts. Could this situation be causing the file descriptors to continuously increase?

Here is the pingsource YAML:

apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
  annotations:
    sources.knative.dev/creator: system:serviceaccount:fission-system:lifaas
    sources.knative.dev/lastModifier: system:serviceaccount:fission-system:lifaas
  creationTimestamp: "2024-09-05T10:02:18Z"
  generation: 1
  labels:
    functionName: dev-gen-task
  name: fnt-21321-8be55ff981d6e3e8
  namespace: faasprod-dev
  resourceVersion: "6207907587"
  uid: 8b60514e-6366-4ab5-89b5-8ca186c097c1
spec:
  contentType: application/json
  schedule: 0/1 * * * *
  sink:
    uri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task
status:
  ceAttributes:
  - source: /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8
    type: dev.knative.sources.ping
  conditions:
  - lastTransitionTime: "2024-09-10T09:48:38Z"
    status: "True"
    type: Deployed
  - lastTransitionTime: "2024-09-10T09:48:38Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-09-05T10:02:18Z"
    status: "True"
    type: SinkProvided
  observedGeneration: 1
  sinkUri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task

liuyuxuan0723 avatar Sep 18 '24 02:09 liuyuxuan0723

@maschmid Can I control the timeout or retry policy for the pingsource? I suspect that the retries might be causing the leak.

liuyuxuan0723 avatar Sep 18 '24 02:09 liuyuxuan0723

I am not aware of any mechanism there, AFAICT there is no timeout on the context that sends the events @pierDipi any thoughts? We should probably guard the dispatchers against a sink that keep connections indefinitely and never replies.

maschmid avatar Sep 30 '24 10:09 maschmid

there is an environment variable with the timeout here: https://github.com/knative/eventing/blob/bff7b031831fd7968d86e99ae89d79cf3f088eed/config/core/deployments/pingsource-mt-adapter.yaml#L79-L80 maybe changing that to set a timeout can help?

pierDipi avatar Dec 05 '24 16:12 pierDipi

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

github-actions[bot] avatar Mar 06 '25 01:03 github-actions[bot]