lightning Runs are lost if step:complete fails

This is a cross-repo issue but starting closer to the user's in Lightning.

We caught a Lost run today (GCP) which was caused because:

step:complete timed out
Probably because the step output was too big at 6mb
The subsequent worker events were all rejected because the step hadn't been completed in Lightning
So the next step was rejected
And any subsequent events have invalid step ids

The result is that all logs and events are rejected and the run is eventually lost. The user is given no feedback.

The worker should recover its capacity though - the engine WILL finish and report back. It's just the Lightning eventing that failed. This is actually a bit bad because the run has probably actually done everything it meant to do - the worklow actually completed, it just didn't tell anyone at openfn.

This isn't actually new - we've seen this before and it's a known issue. I wonder if there's something open. We may have thought we'd seen the last of it, having dealt with a bunch of error cases directly, but alas.

Some suggested fixes (we may even want to implement ALL of these)

In Lightning, instead of erroring onstep:start when the previous one didn't complete, we should write off the previous step as an error and continue accepting logs. I don't quite know what "write off a step as error but finish the workflow" means. Tell you what, let's Crash the step as "Lost" (will we know why it was lost? can we log "timeout" instead?)
Actually: does Lightning know the step:complete event timed-out? If it did, it could mark the step as Complete with status Crashed: Timeout there and then. Now the rest of the run should be processed happily, and the user gets pretty good feedback.
In the Worker, we should probably Kill the run if a key workflow event times out. Timedout logs are fine. Note that by the time the event times-out and we kill the process, the workflow might have actually finished 🤷 Nothing we can do about that
Maybe we should split step-complete up into two events: step:complete returns a dataclip id (a uuid generated by the worker), and then returns that instantly. Then it starts some kind of upload-dataclip event, probably with a long timeout, which can go off and upload the dataclip for that id. Ideally this is non-blocking, so maybe we just POST it rather than use the websocket. Lightning will need to understand that dataclips can be created in some kind of "pending" state while data is uploaded.
I don't think the websocket timeout length is configurable for deployments. It should be. Maybe I want a local deployment with a 1 hour timeout on events. That's my business right? Do we want this per event? Or do we have like regular-timeout and extended-timeout, with two values, where regular lifecycle events must use the regular timeout, but anything with a data payload (logs, complete) uses the extended timeout.

Feb 17 '25 17:02 josephjclark

@theroinaochieng there will be some discussion around this one and it'll probably spin out into more issues. I think it's very very important but probably not super urgent

Feb 17 '25 17:02 josephjclark

Issue raised by @mtuchi . He has a workaround which should be OK.

Feb 17 '25 17:02 josephjclark

From @stuartc : seperate sockets per run (rather than a channel within a shared socket) would reduce the congestion effect of data upload steps. One big dataclip won't delay messages from concurrent runs (which is likely happening now)

Feb 18 '25 08:02 josephjclark

I believe we should do all the things above (maybe not urgently but soon)

But the immediate error appears to be a timeout in postgres:

error] Postgrex.Protocol (#PID<0.756.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2025.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.2025.0> was at location:

    (erts 14.2.5) :prim_inet.recv0/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:3302: Postgrex.Protocol.msg_recv/4
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2304: Postgrex.Protocol.recv_bind/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2196: Postgrex.Protocol.bind_execute/4
    (db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
    (postgrex 0.19.3) lib/postgrex.ex:306: Postgrex.query/4
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1150: Ecto.Adapters.SQL.struct/10
    (ecto 3.12.5) lib/ecto/repo/schema.ex:837: Ecto.Repo.Schema.apply/4
    (ecto 3.12.5) lib/ecto/repo/schema.ex:416: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:326: anonymous fn/2 in Lightning.Runs.Handlers.CompleteStep.update_step/2
    (lightning 2.10.15) lib/lightning/repo.ex:31: anonymous fn/1 in Lightning.Repo.transact/2
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1400: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:314: Lightning.Runs.Handlers.CompleteStep.call/2
    (lightning 2.10.15) lib/lightning_web/channels/run_channel.ex:185: LightningWeb.RunChannel.handle_in/3
    (phoenix 1.7.18) lib/phoenix/channel/server.ex:322: Phoenix.Channel.Server.handle_info/2
    (stdlib 5.2.3) gen_server.erl:1095: :gen_server.try_handle_info/3

Postgres timesout on step:complete for large JSON payloads, and that's the error that throws everything else into chaos.

Feb 18 '25 14:02 josephjclark

this may be fixed by #2682

Feb 19 '25 10:02 taylordowns2000

I'm glad the underlying issue is fixed, but I think that everything in the bullet list in the OP should still be done, like, in the next month. Should we split out issues?

Feb 20 '25 12:02 josephjclark

@theroinaochieng, @stuartc cc @taylordowns2000 @josephjclark as requested I have investigated what else we could for this issue on Lightning.

I have done an experiment setting this on the RuntimeManager:

and posted a webhook request with a dataclip bigger than 1MB: curl -H "content-type: application/json" -d @./dataclip.json http://localhost:4000/i/cae544ab-03dc-4ccc-a09c-fb4edb255d7a

the result was that the dataclip was received by Lightning and saved:

Few additional considerations for action points:

The dataclip was originally bigger than the limit before the Worker fetched the Run but on staging and prod the MAX_DATACLIP_SIZE_MB for the Worker and Lightning are the same so we would be fine on the entry door (on the webhook, before the Run is claimed).
If the dataclip grows during the Run execution, I believe the Worker should not send the dataclip that exceeds the limit to Lightning.

On the step:complete Lightning received an output_dataclip of 1000291 bytes for a MAX_DATACLIP_SIZE_MB="1":

step_complete_payload: %{
  "duration" => 1162,
  "error_message" => nil,
  "error_type" => nil,
  "job_id" => "c2d15b05-6e42-4f2f-bd95-eb8e8b7d680e",
  "mem" => %{"job" => 29252024, "system" => 164675584},
  # truncated
  "output_dataclip" => "{\"references\":[],\"data\":{\"a\":\"aaaaaaaaaaaaaaaaaa" <> ...,
  "output_dataclip_id" => "5f6c421a-5bd3-4ec2-8e17-802c01dc69ed",
  "reason" => "success",
  "step_id" => "510f8de3-bfe6-4905-85ba-984b02fd6151",
  "thread_id" => 2,
  "timestamp" => "1740055089438224"
}
step_complete_output_dataclip_size: 1000291

Feb 20 '25 12:02 jyeshe

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

Feb 20 '25 14:02 josephjclark

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

Sure, that's why I was referring to when the 'dataclip grows' (only on the Runtime). A clear concern for the output dataclip.

One detail that cannot be seen from what I shared is that there was no non-latin character, which I believe is the most common string usage by far until we start with the Nordics (:

This means that the dataclip size becomes much more predictable and MAX_DATACLIP_SIZE_MB reliable without workarounds. Specially because the measurement is about size and not length (length <= size).

Feb 20 '25 15:02 jyeshe

A last bit on how I think Lightning could help on this, if we want to restrict MAX_DATACLIP_SIZE_MB with 100% precision, by applying MAX_DATACLIP_SIZE_MB * 0.95 on the Worker side or not, in Elixir we can check the byte size in constant time, beautiful huh?. Some of the dataclips would be already filtered by the Worker and those that are on ].95, inf[ interval (due to chars with accents or symbols) could be discarded or truncated on Lightning.

Instead of enforcing MAX_DATACLIP_SIZE_MB * 0.95 I would only check the length on the Worker and let Lightning filter the exceeding ones to avoid false positives.

Feb 21 '25 07:02 jyeshe

@jyeshe Not really sure what to do about this. The issue here isn't really that the worker emitted a payload that was too large (the payloads in question are around 4-6mb with a limit of 10mb).

Would you like me to raise an issue because a 1000291 byte payload is sent out of a worker with a 10mb limit? Happy to do so. It would be useful to have the payload in question so that we can try to reproduce. I agree that it shouldn't happen but it's gonna be very low priority for me.

This issue needs breaking up into smaller ones that can be addressed and prioritised. I'm much more concerned about how easily a run is lost because of any error on a key event, and I'd be keen to focus on building out robustness and comms on both ends.

Feb 21 '25 18:02 josephjclark

hey @josephjclark , i'm getting ready for sprint 4 (april 21) and wanted to know if this is actually still "in progress" now or if we should schedule it for sprint 4

Apr 10 '25 13:04 taylordowns2000

In practice this is not In Progress. We did some stuff around it but so far as I'm concerned the core issue is still alive and no dev is in progress. I'm not even entirely sure what this issue is right now - it probably needs closing and breaking down into pieces. The discussion got a bit off track.

If you want to plan lost work for sprint 4 (and you know I'm not working in engineering sprints right?) it's probably a good idea for me to review the outstanding lost issues and prioritise the best things to be working on (it ain't this)

I was planning for my "sprint 4" time to finally get started on the sync & replicate project

Apr 10 '25 16:04 josephjclark