Run predictions off main thread to avoid blocking health check

Open ggilder opened this issue 1 year ago • 1 comments

Fixes https://github.com/replicate/cog/issues/1719

Defining the prediction endpoints with async def runs them on the main thread per FastAPI docs, which is problematic because it blocks the server from responding to the health check endpoint. Converting these to def allows health checks to run and fixes the problem I described in the above issue.

Jun 06 '24 21:06 ggilder

One side effect of this, which may or may not be desirable depending on your perspective, is that prediction requests to an instance that is currently running a prediction now fail with status code 409 and a “currently running a prediction” message, rather than essentially being queued up by uvicorn. I think this is generally desirable since retrying the request could succeed (e.g. in a situation where multiple instances are available behind a load balancer).

Jun 07 '24 19:06 ggilder