[TRI-967] Some JobRunExecutions are getting stuck in "STARTED" state even though the graphile job has an error

Open ericallam opened this issue 2 years ago • 1 comments

The graphile job shows an error of Response: 404, but the JobRunExecution is stuck in STARTED state, even though it should have errored out and done the same to the run. This needs to be investigated.

_TRI-967

Aug 09 '23 11:08 ericallam

@ericallam When the server fails to connect with the endpoint, we throw an error to the graphile worker job so that it handles the retry with the maxAttempts parameter. If the server isn't able to establish a connection after all retry attempts, the graphile worker would mark the job as failed, and we don't update the status of the job run (set to STARTED initially).

https://github.com/triggerdotdev/trigger.dev/blob/d1ecd6b99cb491640a2d8f24579a436737104cb0/apps/webapp/app/services/runs/performRunExecutionV2.server.ts#L260-L262

https://github.com/triggerdotdev/trigger.dev/blob/d1ecd6b99cb491640a2d8f24579a436737104cb0/apps/webapp/app/services/runs/performRunExecutionV2.server.ts#L772-L774

When #337 is done, we can handle the retry as part of the run execution instead of letting the graphile worker do it for us.

Oct 18 '23 10:10 hmacr