starlette icon indicating copy to clipboard operation
starlette copied to clipboard

Replace task cancellation in `BaseHTTPMiddleware` with `http.disconnect`+`recv_stream.close`

Open jhominal opened this issue 3 years ago • 14 comments

  • [x] This is a followup on the issues that are the root cause for #1438
  • [x] This pull request is also a followup on the discussion surrounding task cancellation and handling of unfinished/deadlocked coroutines in the comments on #1697

There have been various issues related to the way that BaseHTTPMiddleware works, and in particular, related to the way that it uses task cancellation in order to force the downstream ASGI app to shut down.

In particular, I would like to highlight that:

  • In a series of issues related to the way that BaseHTTPMiddleware, cancellation is used in order to get out of a deadlock between a call to send_stream.send that never finishes and blocks a downstream app from completing, and a call to recv_stream.receive that is never made because the body_stream iterator has been orphaned by the dispatch method implementation;
  • In #1697, @florimondmanca expresses his opinion that "AFAIK there’s no way in the ASGI spec to just discard a streaming body even though more_body is True. In fact I believe the spec says servers MUST pull until that becomes False." - in other words, using task cancellation on a downstream app may not be permitted by ASGI. I would note that, as ASGI is agnostic to the async framework in use, and as e.g. native asyncio and trio have very different cancellation semantics, that may be the only practical position that ASGI can take;

I believe that this pull request outlines a solution that avoids both using task cancellation on the downstream ASGI application, and also avoids deadlocking the downstream ASGI application. It does that by replacing task cancellation with the following features:

  • Instead of triggering task cancellation on the task group that runs the downstream ASGI application, an anyio.Event app_disconnected is set, that triggers the following consequences:
    • It hooks the receive function that is passed to the downstream ASGI application so that, if app_disconnected is set, a message with type http.disconnect will be returned (instead of waiting for the next message from request.receive);
    • It closes the recv_stream, which has the effect of removing the root cause of the known deadlock issue. However, as send_stream.send raises an anyio.BrokenResourceError in that case, we need to wrap send_stream.send before passing it to the downstream ASGI application;

What are the consequences of that proposed change?

  1. We cannot rely on task cancellation to force the downstream ASGI application to stop its processing. However, as task cancellation is a behavior that only happens when a BaseHTTPMiddleware is found in an ASGI middleware chain, and that is completely unexpected by ASGI applications (even Starlette's own response classes do not take that possibility in account), that may not be an actual loss of functionality;
  2. By sending a {"type":"http.disconnect"} message, we allow ASGI applications that listen to that event (such as Starlette's StreamingResponse) to potentially wrap up their processing once it becomes clear that their work will be discarded and start cleanup and finalization (e.g. something may be added to FileResponse);
  3. By closing the recv_stream once the response has been completely sent, we remove the root cause of the deadlock, and ensure that it cannot happen even e.g. in the face of a fully cancellation-shielded application;

There are two larger points that concern conformance to the ASGI specification / expected behavior (however, I do not know even with whom we could raise these points):

  • Should we confirm whether task cancellation can be used (or not) to interrupt a running ASGI application?
  • Should we confirm whether the usage of {"type":"http.disconnect"}, in order to signal to the application "Your link with the HTTP client has been severed, there is no need to continue producing http.response.body messages", is semantically compatible with what ASGI specifies?

~~This PR is a draft because, as I did not have validation from the maintainers to propose such a change, I did not add the tests that would prove that various issues (some of them the same as those fixed by the many other PRs on the same subject) are actually fixed by my proposal. In case any maintainer expresses interest in actually trying to integrate this proposal, I will gladly work to complete this PR with the necessary tests.~~

This PR is not a draft anymore, and:

  • fixes #919
  • fixes #1438
  • closes #1699
  • closes #1700
  • closes #1710

jhominal avatar Jun 29 '22 23:06 jhominal

@jhominal Hey, thanks a lot for this. I think I would be very interested in seeing what the tests look like. Might help me better grok what practical situations this PR would help address.

About:

@florimondmanca expresses his opinion that "AFAIK there’s no way in the ASGI spec to just discard a streaming body even though more_body is True. In fact I believe the spec says servers MUST pull until that becomes False." - in other words, using task cancellation on a downstream app may not be permitted by ASGI.

This came from what I would call an interpretation of the spec, which reads (emphasis mine):

https://asgi.readthedocs.io/en/latest/specs/www.html#request-receive-event

more_body (bool) – Signifies if there is additional content to come (as part of a Request message). If True, the consuming application should wait until it gets a chunk with this set to False. If False, the request is complete and should be processed. Optional; if missing defaults to False.

I interpret "you should consume as long as more_body is True" as implying that "you should not interrupt the application until more_body becomes False", with the sense that "cancellation" is a form of "interruption".

So, as for...

  • Should we confirm whether task cancellation can be used (or not) to interrupt a running ASGI application?

My interpretation (although the spec doesn't mention any notion of "cancellation") would be, no.

As for http.disconnect, the spec reads:

https://asgi.readthedocs.io/en/latest/specs/www.html#disconnect-receive-event

Sent to the application when a HTTP connection is closed or if receive is called after a response has been sent. This is mainly useful for long-polling, where you may want to trigger cleanup code if the connection closes early.

Clearly, we are in the second situation ("if receive is called after a response has been set"), because we now have this:

https://github.com/encode/starlette/blob/c98203edcb4e210f630f5a39e2ed453561b537a0/starlette/middleware/base.py#L120-L121

So the answer to...

  • Should we confirm whether the usage of {"type":"http.disconnect"}, in order to signal to the application "Your link with the HTTP client has been severed, there is no need to continue producing http.response.body messages", is semantically compatible with what ASGI specifies?

Seems to be: yes, the http.disconnect usage here seems very appropriate. Promising!

We might want to make this even clearer by renaming the app_disconnected event to response_sent.

florimondmanca avatar Jul 01 '22 17:07 florimondmanca

@jhominal Am I correct saying this would be an alternative to #1710?

florimondmanca avatar Jul 01 '22 17:07 florimondmanca

@florimondmanca Thank you for your reply! I am going to work on the test cases - at least a few of them will be similar to existing issues/pull requests.

But yes, it is my belief that this proposal has the potential to be an alternative to #1710, #1700 (which fixes the "background tasks are canceled" issue by pushing them outside of the response processing), #1699, #1441 (which was closed yesterday).

I believe it would fix at least #1438, which is the issue I raised a little time ago.

jhominal avatar Jul 01 '22 18:07 jhominal

This does look very nice, I do think removing the cancellation is a step in the right direction.

this proposal has the potential to be an alternative to https://github.com/encode/starlette/pull/1700 (which fixes the "background tasks are canceled" issue by pushing them outside of the response processing)

Did you identify any issues with that proposal so that we can have them in a comment for posteriority / we can close that PR?

adriangb avatar Jul 02 '22 05:07 adriangb

I will just comment on the two issues that I believe are still open and would be directly affected by this PR:

  • #1438 would be fixed by this PR, as shown by the added test (thanks to @adriangb for his own version of that test in #1700);
  • For #919, I have not been able to add a substantial test case, as I have not been able to reproduce the headline issue as described by the issue reporter, either on current starlette versions, or by installing starlette==0.13.2+uvicorn==0.11.4 - I will make a more detailed comment on that issue;

jhominal avatar Jul 02 '22 05:07 jhominal

I tried to add a test case for this in https://github.com/encode/starlette/blob/f5cb1ee53d569bba5ecaf8ca449fa0b59ea21e0f/tests/middleware/test_base.py#L265-L293

The original issue says "subsequent ones are not processed until the 10 second sleep has finished (the first request returns before then though)". I was not able to reproduce that specific outcome, but I was able to reproduce the BackgroundTask not running when there was a BaseHTTPMiddleware present. I think it is the same bug, the behavior has just changed slightly over time.

adriangb avatar Jul 02 '22 06:07 adriangb

@adriangb: I have looked at your test cases while implementing mine. However, after porting and reviewing a copy of the #919 test, I ~~saw~~ thought that the only substantial difference between my test cases for #1438 and #919 was that one test was expressed in pure ASGI terms, while the other was expressed using the TestClient (and relies on the fact that TestClient sends http.disconnect immediately after receiving the response - but I am of the opinion that TestClient should not be relied on for ASGI-level details). Because of that, I took out the test for #919.

As some of the issues reported on #919 were at a time when Starlette was implemented directly on top of asyncio, and given the difference that using anyio makes to the usable primitives, it is extremely likely that these issues cannot be reproduced anymore.

jhominal avatar Jul 02 '22 06:07 jhominal

Hmm good point, looking at my test now I think that may be the case. I'll dig deeper tomorrow, thank you for combing through them.

adriangb avatar Jul 02 '22 06:07 adriangb

Hmm good point, looking at my test now I think that may be the case. I'll dig deeper tomorrow, thank you for combing through them.

I just reread your test for #1438 and I realize that actually, it has a difference (which is that the background task waits on the disconnected event). I will admit that, as I had already implemented the test for #1438 before reading yours, I did not read that test case that closely - I mostly compared my version of #1438 and my version of #919 (based on yours).

jhominal avatar Jul 02 '22 06:07 jhominal

I have just added a test to check for the issue reported by @kissgyorgy on https://github.com/encode/starlette/issues/1678#issuecomment-1172916042 - about the way that BaseHTTPMiddleware cancellation prevents the context manager defined in the middleware from running its exit function.

Fixing this test did not require any modification to the PR's modifications of base.py.

jhominal avatar Jul 06 '22 20:07 jhominal

@florimondmanca As you previously expressed interest in this PR, I chose to request you as a reviewer. Please feel free to suggest anyone else who you think should also take a look at this.

jhominal avatar Jul 07 '22 05:07 jhominal

I agree that this PR is languishing a bit, so I have decided to request review from all encode maintainers, in the hope that more people ~~can~~ will take a look at this.

jhominal avatar Aug 16 '22 09:08 jhominal

I'll check tomorrow. I've added this to the checklist for the next release.

Kludex avatar Sep 06 '22 11:09 Kludex

@adriangb @kludex Thank you very much for taking the time to look at it! (I have not been in Gitter much recently so I do not know if you have talked about this PR over there)

jhominal avatar Sep 06 '22 11:09 jhominal

I'll let you merge @jhominal :+1:

Thanks for this, and sorry for taking long to review.

Beautiful code. Well thought. :+1:

Kludex avatar Sep 21 '22 20:09 Kludex

I'll merge this, since it's the only PR missing for the release.

Kludex avatar Sep 24 '22 05:09 Kludex