File Descriptor Leak in PingSource
Problem Description:
I followed the documentation https://knative.dev/docs/eventing/sources/ping-source/ to deploy and create several PingSources to trigger my functions. After running for a few days, all triggers stopped working. The logs from pingsource-mt-adapter are as follows:
2024-09-09 10:03:03
{"level":"error","ts":"2024-09-09T02:03:03.737Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0 Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task <nil>} {id 15 0 1118f752-e468-4fa4-8971-11b634b92413 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}
2024-09-09 10:03:03
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0 Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21248-8be55ff981d6e3e8 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/push-shadow-vin-down <nil>} {id 15 0 5e1acd95-26a3-44f1-a25f-2630609bc3a4 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}
2024-09-09 10:03:03
{"level":"error","ts":"2024-09-09T02:03:03.585Z","logger":"pingsource-mt-adapter","caller":"mtping/runner.go:157","msg":"failed to send cloudevent result: {result 26 0 Post \"http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/\": dial tcp 172.16.53.129:80: connect: cannot assign requested address (5x)} {source 15 0 /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21228-2195be836486b1a6 <nil>} {target 15 0 http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/slipping-point/ <nil>} {id 15 0 158a4a33-56f4-4929-ad32-144035504ee8 <nil>}","commit":"cf0f9d5","stacktrace":"knative.dev/eventing/pkg/adapter/mtping.(*cronJobsRunner).cronTick.func1\n\tknative.dev/eventing/pkg/adapter/mtping/runner.go:157\ngithub.com/robfig/cron/v3.FuncJob.Run\n\tgithub.com/robfig/cron/[email protected]/cron.go:136\ngithub.com/robfig/cron/v3.(*Cron).startJob.func1\n\tgithub.com/robfig/cron/[email protected]/cron.go:312"}
Monitoring Information:
-
File Descriptor Monitoring:
-
CPU Monitoring:
-
Memory Monitoring:
Environment:
- Knative release version: 1.11.2
- k8s version: 1.24.0
Expected Behavior:
The pingsource-mt-adapter should run normally, and there should be no leaks.
I cannot reproduce the problem with "normally" behaving functions.
Trying it with the current Fission 1.20.4 , and eventing 1.11 installed from the Serverless Operator 1.32.2 on OpenShift
Creating a hello-js from the fission examples
fission env create --name nodejs --image ghcr.io/fission/node-env
curl -LO https://raw.githubusercontent.com/fission/examples/main/nodejs/hello.js
fission function create --name hello-js --env nodejs --code hello.js
fission route create --function hello-js --url /hello-js --name hello-js --method GET --method POST
and having several PingSources like
apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
name: test-ping-source-1
namespace: foobar
spec:
contentType: application/json
data: '{"message": "Hello world from 1!"}'
schedule: '*/1 * * * *'
sink:
uri: http://router.fission.svc.cluster.local/hello-js
timezone: Asia/Shanghai
shows only a stable number of open TCP connections.
I can reproduce the problem, kind of, if I add very long timeouts to the function
module.exports = async function(context) {
await new Promise(r => setTimeout(r, 518400000));
return {
status: 200,
body: "hello, world!\n"
};
}
and adding a fntimeout to the fission function to override the default 60s
fission function create --name hello-js --env nodejs --code hello.js --fntimeout 604800
making the functions to never return does show that the pingsource does not have any timeouts on its own and the number of open TCP connections would be steadily rising if the sink would keep the connections open indefinitely and never return an HTTP response.
Can you see the number of open files rising on the fission side as well? (I'd expect the similar number of open files on the fission router)
@maschmid I have checked the file descriptors of the Fission router and indeed encountered the same issue. The regular trigger interval being shorter than the function execution time does lead to a continuous increase in the number of open file descriptors.
Fission router monitoring:
Currently, I have deployed about 4 or 5 functions, all with a timeout set to 60 seconds. The trigger interval is set to once per minute. I haven't observed any long-running executions or cases where an HTTP response is not returned. However, quite a few functions returned a 504 due to execution timeouts. Could this situation be causing the file descriptors to continuously increase?
Here is the pingsource YAML:
apiVersion: sources.knative.dev/v1
kind: PingSource
metadata:
annotations:
sources.knative.dev/creator: system:serviceaccount:fission-system:lifaas
sources.knative.dev/lastModifier: system:serviceaccount:fission-system:lifaas
creationTimestamp: "2024-09-05T10:02:18Z"
generation: 1
labels:
functionName: dev-gen-task
name: fnt-21321-8be55ff981d6e3e8
namespace: faasprod-dev
resourceVersion: "6207907587"
uid: 8b60514e-6366-4ab5-89b5-8ca186c097c1
spec:
contentType: application/json
schedule: 0/1 * * * *
sink:
uri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task
status:
ceAttributes:
- source: /apis/v1/namespaces/faasprod-dev/pingsources/fnt-21321-8be55ff981d6e3e8
type: dev.knative.sources.ping
conditions:
- lastTransitionTime: "2024-09-10T09:48:38Z"
status: "True"
type: Deployed
- lastTransitionTime: "2024-09-10T09:48:38Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-09-05T10:02:18Z"
status: "True"
type: SinkProvided
observedGeneration: 1
sinkUri: http://router.fission-system.svc.cluster.local/fission-function/faasprod-dev/dev-gen-task
@maschmid Can I control the timeout or retry policy for the pingsource? I suspect that the retries might be causing the leak.
I am not aware of any mechanism there, AFAICT there is no timeout on the context that sends the events @pierDipi any thoughts? We should probably guard the dispatchers against a sink that keep connections indefinitely and never replies.
there is an environment variable with the timeout here: https://github.com/knative/eventing/blob/bff7b031831fd7968d86e99ae89d79cf3f088eed/config/core/deployments/pingsource-mt-adapter.yaml#L79-L80 maybe changing that to set a timeout can help?
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.