egress icon indicating copy to clipboard operation
egress copied to clipboard

[BUG] Stale Egress Jobs - that can't be deleted/removed and they don't exist in egress pod

Open mpisat opened this issue 6 months ago • 1 comments

Describe the bug We observed some egress tasks remain active in the list, despite they did not work / crashed and never sent an EGRESS_FAILED when that happens. Those ghost/stale egress tasks can not be removed by API request. For example:

lk egress stop --id EG_Ahi8U374o7qQ
Using url, api-key, api-secret from environment
Error stopping Egress EG_Ahi8U374o7qQ twirp error unavailable: no response from servers
twirp error unavailable: no response from servers

when I check the tasks in the pod: ps auxwww | grep 490963279_63223 egress 18760 0.0 0.0 3528 1668 pts/0 S+ 02:30 0:00 grep 490963279_63223 there is no such task. here is the pod:

kubernetes.pod_name | prod-xxx-livekit-egress-green-7f87655fff-rsqpc
-- | --
  | log | 2025-07-21T06:57:23.680Z	INFO	egress	info/io.go:178	egress_active	{"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}
  | stream | stderr
  | time | Jul 21, 2025 @ 13:57:23.680
  | uuid | a37910c6-8bfd-402d-bf6d-782699c78cc2

here is the ps from pod:

ps auxwww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
egress         1  0.0  0.0   2684   708 ?        Ss   Jul21   0:01 /tini -- egress
egress        10  0.0  0.0 241360  9200 ?        Sl   Jul21   0:00 pulseaudio -D --verbose --exit-idle-time=-1 --disallow-exit
egress        13  0.2  0.0 5278060 76776 ?       Sl   Jul21   3:19 egress
egress     12181  110  0.2 7973128 938392 ?      SLsl Jul21 412:42 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis:     db: 1     sentinel_master_name: mymaster     sentinel_addresses:         - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379     dial_timeout: 2000     read_timeout: 200     write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging:     level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits:     file_output_max_duration: 24h0m0s     stream_output_max_duration: 24h0m0s     segment_output_max_duration: 24h0m0s     image_output_max_duration: 0s insecure: false debug:     enable_profiling: false     prefix: ""     generate_presigned_url: false     s3: null     azure: null     gcp: null     alioss: null handler_id: EGH_PtJWrQXUhCNP tmp_dir: /home/egress/tmp/EG_eyHWTuXuKk2Q  --request {"egressId":"EG_eyHWTuXuKk2Q","participant":{"roomName":"prod_488781408_29728","identity":"488781408_29728","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.18:1935/xxx/prod_488781408_29728"]}]},"roomId":"RM_npMGg5z6WEgp"}
egress     15774  102  0.2 7981504 902964 ?      SLsl Jul21 227:51 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis:     db: 1     sentinel_master_name: mymaster     sentinel_addresses:         - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379     dial_timeout: 2000     read_timeout: 200     write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging:     level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits:     file_output_max_duration: 24h0m0s     stream_output_max_duration: 24h0m0s     segment_output_max_duration: 24h0m0s     image_output_max_duration: 0s insecure: false debug:     enable_profiling: false     prefix: ""     generate_presigned_url: false     s3: null     azure: null     gcp: null     alioss: null handler_id: EGH_4BVyAhnD4w9E tmp_dir: /home/egress/tmp/EG_F6ruysKG58GL  --request {"egressId":"EG_F6ruysKG58GL","participant":{"roomName":"prod_483823052_69752","identity":"483823052_69752","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.25:1935/xxx/prod_483823052_69752"]}]},"roomId":"RM_gbmJnhd7CXY6"}
egress     17069  129  0.2 7689700 856920 ?      SLsl 00:09 181:41 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis:     db: 1     sentinel_master_name: mymaster     sentinel_addresses:         - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379     dial_timeout: 2000     read_timeout: 200     write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging:     level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits:     file_output_max_duration: 24h0m0s     stream_output_max_duration: 24h0m0s     segment_output_max_duration: 24h0m0s     image_output_max_duration: 0s insecure: false debug:     enable_profiling: false     prefix: ""     generate_presigned_url: false     s3: null     azure: null     gcp: null     alioss: null handler_id: EGH_w5gy6RQy5ph5 tmp_dir: /home/egress/tmp/EG_DDYPRT85ohRd  --request {"egressId":"EG_DDYPRT85ohRd","participant":{"roomName":"prod_306406835_68600","identity":"306406835_68600","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.18:1935/xxx/prod_306406835_68600"]}]},"roomId":"RM_gzcWFHyWJB7M"}
egress     17787  110  0.2 7849664 910984 ?      SLsl 01:02  95:57 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis:     db: 1     sentinel_master_name: mymaster     sentinel_addresses:         - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379     dial_timeout: 2000     read_timeout: 200     write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging:     level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits:     file_output_max_duration: 24h0m0s     stream_output_max_duration: 24h0m0s     segment_output_max_duration: 24h0m0s     image_output_max_duration: 0s insecure: false debug:     enable_profiling: false     prefix: ""     generate_presigned_url: false     s3: null     azure: null     gcp: null     alioss: null handler_id: EGH_awFFncmMZbvS tmp_dir: /home/egress/tmp/EG_XRKhAtBRPyTy  --request {"egressId":"EG_XRKhAtBRPyTy","participant":{"roomName":"prod_196839063_76581","identity":"196839063_76581","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.27:1935/xxx/prod_196839063_76581"]}]},"roomId":"RM_CjuaVHeRW8uv"}
egress     18566  104  0.1 7530936 728248 ?      SLsl 01:54  36:57 egress run-handler --config nodeid: NE_Meg4C2U89DMC redis:     db: 1     sentinel_master_name: mymaster     sentinel_addresses:         - redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379         - redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379     dial_timeout: 2000     read_timeout: 200     write_timeout: 200 api_key: key1 api_secret: YmVlbiBoaWRkZW4gZnJvbSB0aGUgVVNBR0Ugbm90ZXMgYW5kIG1heSBiZSByZW1vdmVkIGluIGZ1 ws_url: ws://prod-aff-livekit-server-green:80 logging:     level: info template_base: http://localhost:7980/ cluster_id: "" enable_chrome_sandbox: false max_upload_queue: 60 disallow_local_storage: false enable_room_composite_sdk_source: false io_create_timeout: 15s io_update_timeout: 30s session_limits:     file_output_max_duration: 24h0m0s     stream_output_max_duration: 24h0m0s     segment_output_max_duration: 24h0m0s     image_output_max_duration: 0s insecure: false debug:     enable_profiling: false     prefix: ""     generate_presigned_url: false     s3: null     azure: null     gcp: null     alioss: null handler_id: EGH_QKvCrNJhSmeB tmp_dir: /home/egress/tmp/EG_HxbXWZCPfmWC  --request {"egressId":"EG_HxbXWZCPfmWC","participant":{"roomName":"prod_493518475_30962","identity":"493518475_30962","streamOutputs":[{"protocol":"RTMP","urls":["rtmp://1.2.3.8:1935/xxx/prod_493518475_30962"]}]},"roomId":"RM_bkoidCv9zGLz"}

Egress Version 1.9.0

Egress Request

[
  {
    "egress_id": "EG_Ahi8U374o7qQ",
    "room_id": "RM_BUv8L9FgJWPD",
    "room_name": "prod_490963279_63223",
    "source_type": 1,
    "status": 1,
    "started_at": 1753081042369616856,
    "updated_at": 1753081045078470922,
    "Request": {
      "Participant": {
        "room_name": "prod_490963279_63223",
        "identity": "490963279_63223",
        "Options": null,
        "stream_outputs": [
          {
            "protocol": 1,
            "urls": [
              "rtmp://1.2.3.4:1935/xxx/{pro...223}"
            ]
          }
        ]
      }
    },
    "Result": {
      "Stream": {
        "info": [
          {
            "url": "rtmp://1.2.3.4:1935/xxx/{pro...223}",
            "started_at": 1753081045078470641
          }
        ]
      }
    },
    "stream_results": [
      {
        "url": "rtmp://1.2.3.4:1935/xxx/{pro...223}",
        "started_at": 1753081045078470641
      }
    ]
  }
]

Additional context This happens regularly on some broadcasts and unable to stop egress is a pain. I wish there was a way to remove egress tasks on failure with --force or something (or some other payload)

Logs

2025-07-22T02:27:12.323Z WARN livekit.psrpc.EgressHandler.StopEgress rpc/logging.go:66 client error {"topic": ["EG_Ahi8U374o7qQ"], "request": {"egressId": "EG_Ahi8U374o7qQ"}, "response": null, "duration": "3.00178341s", "error": "no response from servers"} From egress pod related to that broadcaster:

2025-07-21T06:57:25.080Z	INFO	egress	info/io.go:178	egress_active	{"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}

2025-07-21T06:57:23.680Z	INFO	egress	info/io.go:178	egress_active	{"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "error": "", "code": 0, "details": ""}

2025-07-21T06:57:23.676Z	INFO	egress	pipeline/watch.go:257	TR_AMmDS9dKFMk7Wj playing	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}

2025-07-21T06:57:23.676Z	INFO	egress	pipeline/watch.go:257	TR_VCwveNbXBMEHdw playing	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}

2025-07-21T06:57:23.676Z	INFO	egress	pipeline/watch.go:252	pipeline playing	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}

2025-07-21T06:57:23.582Z	INFO	[email protected]/remoteparticipant.go:119	track subscribed	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "participant": "490963279_63223", "track": "TR_VCwveNbXBMEHdw", "kind": "video"}

2025-07-21T06:57:23.493Z	INFO	[email protected]/remoteparticipant.go:119	track subscribed	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "participant": "490963279_63223", "track": "TR_AMmDS9dKFMk7Wj", "kind": "audio"}

2025-07-21T06:57:23.466Z	INFO	egress	source/sdk.go:410	subscribing to track	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "trackID": "TR_VCwveNbXBMEHdw"}

2025-07-21T06:57:23.466Z	INFO	egress	source/sdk.go:410	subscribing to track	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "trackID": "TR_AMmDS9dKFMk7Wj"}

2025-07-21T06:57:22.370Z	INFO	egress	redis/redis.go:99	connecting to redis	{"nodeID": "NE_jXdN3YUhQYhR", "handlerID": "EGH_6MzEY2QqqEqL", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "sentinel": true, "addr": ["redis-livekit-prod-node-0.redis-livekit-prod-headless.infra-prod:26379", "redis-livekit-prod-node-1.redis-livekit-prod-headless.infra-prod:26379", "redis-livekit-prod-node-2.redis-livekit-prod-headless.infra-prod:26379"], "masterName": "mymaster"}

2025-07-21T06:57:22.346Z	INFO	egress	server/server_rpc.go:58	request received	{"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ"}

2025-07-21T06:57:22.346Z	INFO	egress	server/server_rpc.go:68	request validated	{"nodeID": "NE_jXdN3YUhQYhR", "clusterID": "", "egressID": "EG_Ahi8U374o7qQ", "requestType": "participant", "outputType": "stream", "room": "prod_490963279_63223", "request": {"Participant":{"room_name":"prod_490963279_63223","identity":"490963279_63223","Options":null,"stream_outputs":[{"protocol":1,"urls":["rtmp://1.2.3.4:1935/xxx/{pro...223}"]}]}}}

mpisat avatar Jul 22 '25 02:07 mpisat

I masked IPs (except for last digit in some logs above -- please ignore it) livekit server version is 1.8.4. egress pod uptime is 47d prod-xxx-livekit-egress-green-7f87655fff-rsqpc 1/1 Running 1 47d 10.224.112.179 livekit/livekit and livekit/egress is hosted on same datacenter, same kubernetes cluster. I don't want to meddle with redis db and delete tasks manually, livekit should be able to do it somehow.

mpisat avatar Jul 22 '25 02:07 mpisat