[BUG] Stale Egress Jobs - that can't be deleted/removed and they don't exist in egress pod
Describe the bug We observed some egress tasks remain active in the list, despite they did not work / crashed and never sent an EGRESS_FAILED when that happens. Those ghost/stale egress tasks can not be removed by API request. For example:
lk egress stop --id EG_Ahi8U374o7qQ
Using url, api-key, api-secret from environment
Error stopping Egress EG_Ahi8U374o7qQ twirp error unavailable: no response from servers
twirp error unavailable: no response from servers
when I check the tasks in the pod: ps auxwww | grep 490963279_63223 egress 18760 0.0 0.0 3528 1668 pts/0 S+ 02:30 0:00 grep 490963279_63223 there is no such task. here is the pod:
kubernetes.pod_name | prod-xxx-livekit-egress-green-7f87655fff-rsqpc
-- | --
| log | 2025-07-21T06:57:23.680Z INFO egress info/io.go:178 egress_active {"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}
| stream | stderr
| time | Jul 21, 2025 @ 13:57:23.680
| uuid | a37910c6-8bfd-402d-bf6d-782699c78cc2
here is the ps from pod:
ps auxwww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
egress 1 0.0 0.0 2684 708 ? Ss Jul21 0:01 /tini -- egress
egress 10 0.0 0.0 241360 9200 ? Sl Jul21 0:00 pulseaudio -D --verbose --exit-idle-time=-1 --disallow-exit
egress 13 0.2 0.0 5278060 76776 ? Sl Jul21 3:19 egress
egress 12181 110 0.2 7973128 938392 ? SLsl Jul21 412:42 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis: db: 1 sentinel_master_name: mymaster sentinel_addresses: - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379 dial_timeout: 2000 read_timeout: 200 write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging: level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits: file_output_max_duration: 24h0m0s stream_output_max_duration: 24h0m0s segment_output_max_duration: 24h0m0s image_output_max_duration: 0s insecure: false debug: enable_profiling: false prefix: "" generate_presigned_url: false s3: null azure: null gcp: null alioss: null handler_id: EGH_PtJWrQXUhCNP tmp_dir: /home/egress/tmp/EG_eyHWTuXuKk2Q --request {"egressId":"EG_eyHWTuXuKk2Q","participant":{"roomName":"prod_488781408_29728","identity":"488781408_29728","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.18:1935/xxx/prod_488781408_29728"]}]},"roomId":"RM_npMGg5z6WEgp"}
egress 15774 102 0.2 7981504 902964 ? SLsl Jul21 227:51 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis: db: 1 sentinel_master_name: mymaster sentinel_addresses: - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379 dial_timeout: 2000 read_timeout: 200 write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging: level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits: file_output_max_duration: 24h0m0s stream_output_max_duration: 24h0m0s segment_output_max_duration: 24h0m0s image_output_max_duration: 0s insecure: false debug: enable_profiling: false prefix: "" generate_presigned_url: false s3: null azure: null gcp: null alioss: null handler_id: EGH_4BVyAhnD4w9E tmp_dir: /home/egress/tmp/EG_F6ruysKG58GL --request {"egressId":"EG_F6ruysKG58GL","participant":{"roomName":"prod_483823052_69752","identity":"483823052_69752","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.25:1935/xxx/prod_483823052_69752"]}]},"roomId":"RM_gbmJnhd7CXY6"}
egress 17069 129 0.2 7689700 856920 ? SLsl 00:09 181:41 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis: db: 1 sentinel_master_name: mymaster sentinel_addresses: - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379 dial_timeout: 2000 read_timeout: 200 write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging: level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits: file_output_max_duration: 24h0m0s stream_output_max_duration: 24h0m0s segment_output_max_duration: 24h0m0s image_output_max_duration: 0s insecure: false debug: enable_profiling: false prefix: "" generate_presigned_url: false s3: null azure: null gcp: null alioss: null handler_id: EGH_w5gy6RQy5ph5 tmp_dir: /home/egress/tmp/EG_DDYPRT85ohRd --request {"egressId":"EG_DDYPRT85ohRd","participant":{"roomName":"prod_306406835_68600","identity":"306406835_68600","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.18:1935/xxx/prod_306406835_68600"]}]},"roomId":"RM_gzcWFHyWJB7M"}
egress 17787 110 0.2 7849664 910984 ? SLsl 01:02 95:57 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis: db: 1 sentinel_master_name: mymaster sentinel_addresses: - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379 dial_timeout: 2000 read_timeout: 200 write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging: level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits: file_output_max_duration: 24h0m0s stream_output_max_duration: 24h0m0s segment_output_max_duration: 24h0m0s image_output_max_duration: 0s insecure: false debug: enable_profiling: false prefix: "" generate_presigned_url: false s3: null azure: null gcp: null alioss: null handler_id: EGH_awFFncmMZbvS tmp_dir: /home/egress/tmp/EG_XRKhAtBRPyTy --request {"egressId":"EG_XRKhAtBRPyTy","participant":{"roomName":"prod_196839063_76581","identity":"196839063_76581","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.27:1935/xxx/prod_196839063_76581"]}]},"roomId":"RM_CjuaVHeRW8uv"}
egress 18566 104 0.1 7530936 728248 ? SLsl 01:54 36:57 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis: db: 1 sentinel_master_name: mymaster sentinel_addresses: - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379 - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379 dial_timeout: 2000 read_timeout: 200 write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging: level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits: file_output_max_duration: 24h0m0s stream_output_max_duration: 24h0m0s segment_output_max_duration: 24h0m0s image_output_max_duration: 0s insecure: false debug: enable_profiling: false prefix: "" generate_presigned_url: false s3: null azure: null gcp: null alioss: null handler_id: EGH_QKvCrNJhSmeB tmp_dir: /home/egress/tmp/EG_HxbXWZCPfmWC --request {"egressId":"EG_HxbXWZCPfmWC","participant":{"roomName":"prod_493518475_30962","identity":"493518475_30962","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.8:1935/xxx/prod_493518475_30962"]}]},"roomId":"RM_bkoidCv9zGLz"}
Egress Version 1.9.0
Egress Request
[
{
"egress_id": "EG_Ahi8U374o7qQ",
"room_id": "RM_BUv8L9FgJWPD",
"room_name": "prod_490963279_63223",
"source_type": 1,
"status": 1,
"started_at": 1753081042369616856,
"updated_at": 1753081045078470922,
"Request": {
"Participant": {
"room_name": "prod_490963279_63223",
"identity": "490963279_63223",
"Options": null,
"stream_outputs": [
{
"protocol": 1,
"urls": [
"rtmp://1.2.3.4:1935/xxx/{pro...223}"
]
}
]
}
},
"Result": {
"Stream": {
"info": [
{
"url": "rtmp://1.2.3.4:1935/xxx/{pro...223}",
"started_at": 1753081045078470641
}
]
}
},
"stream_results": [
{
"url": "rtmp://1.2.3.4:1935/xxx/{pro...223}",
"started_at": 1753081045078470641
}
]
}
]
Additional context This happens regularly on some broadcasts and unable to stop egress is a pain. I wish there was a way to remove egress tasks on failure with --force or something (or some other payload)
Logs
2025-07-22T02:27:12.323Z WARN livekit.psrpc.EgressHandler.StopEgress rpc/logging.go:66 client error {"topic": ["EG_Ahi8U374o7qQ"], "request": {"egressId": "EG_Ahi8U374o7qQ"}, "response": null, "duration": "3.00178341s", "error": "no response from servers"}
From egress pod related to that broadcaster:
2025-07-21T06:57:25.080Z INFO egress info/io.go:178 egress_active {"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}
2025-07-21T06:57:23.680Z INFO egress info/io.go:178 egress_active {"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}
2025-07-21T06:57:23.676Z INFO egress pipeline/watch.go:257 TR_AMmDS9dKFMk7Wj playing {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}
2025-07-21T06:57:23.676Z INFO egress pipeline/watch.go:257 TR_VCwveNbXBMEHdw playing {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}
2025-07-21T06:57:23.676Z INFO egress pipeline/watch.go:252 pipeline playing {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}
2025-07-21T06:57:23.582Z INFO [email protected]/remoteparticipant.go:119 track subscribed {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "participant": "490963279_63223", "track": "TR_VCwveNbXBMEHdw", "kind": "video"}
2025-07-21T06:57:23.493Z INFO [email protected]/remoteparticipant.go:119 track subscribed {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "participant": "490963279_63223", "track": "TR_AMmDS9dKFMk7Wj", "kind": "audio"}
2025-07-21T06:57:23.466Z INFO egress source/sdk.go:410 subscribing to track {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "trackID": "TR_VCwveNbXBMEHdw"}
2025-07-21T06:57:23.466Z INFO egress source/sdk.go:410 subscribing to track {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "trackID": "TR_AMmDS9dKFMk7Wj"}
2025-07-21T06:57:22.370Z INFO egress redis/redis.go:99 connecting to redis {"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "sentinel": true, "addr": ["redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379", "redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379", "redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379"], "masterName": "mymaster"}
2025-07-21T06:57:22.346Z INFO egress server/server_rpc.go:58 request received {"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}
2025-07-21T06:57:22.346Z INFO egress server/server_rpc.go:68 request validated {"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "room": "prod_490963279_63223", "request": {"Participant":{"room_name":"prod_490963279_63223","identity":"490963279_63223","Options":null,"stream_outputs":[{"protocol":1,"urls":["rtmp://1.2.3.4:1935/xxx/{pro...223}"]}]}}}
I masked IPs (except for last digit in some logs above -- please ignore it) livekit server version is 1.8.4. egress pod uptime is 47d prod-xxx-livekit-egress-green-7f87655fff-rsqpc 1/1 Running 1 47d 10.224.112.179 livekit/livekit and livekit/egress is hosted on same datacenter, same kubernetes cluster. I don't want to meddle with redis db and delete tasks manually, livekit should be able to do it somehow.