trace.id is not available in fastapi exception handler
Describe the bug: We have fastapi framework, and we add apm_client for starlette to the application, we want to generate trace.id to correlate logs when an exception happens. I have structlog package for the logging and import elasticapm.handlers.structlog.structlog_processor. We set exception handlers in fastapi application for our excpected handlers and also one for Exception class to capture the ones that we do not expect. I can get trace.id in child exceptions but not in the exception handler that is comparing to root.
My goal was to have log Exception types and have log correlation in them.
I am not sure if it is bug or feature request🤷♂️ To Reproduce
- create apm agent and fastapi app, add the apm agent to application
from fastapi import FastAPI
from elasticapm.contrib.starlette import make_apm_client
from elasticapm.contrib.starlette import ElasticAPM
apm = make_apm_client(config=config)
app = FastAPI()
app.add_middleware(ElasticAPM, client=apm)
- configure structlog
import structlog
from elasticapm.handlers.structlog import structlog_processor
def processor(logger, level, log_dict):
return json.dumps(log_dict)
structlog.configure(processors=[structlog_processor, processor])
- create an child exceptions and root and child exception
class AuthException(Exception):
def __ini__(self):
super().__init__()
@app.exception_handler(Exception)
async def fallback_handler(request , exc: Exception):
getLogger().info("Uncaught exception during request handling happend")
return JSONResponse({"detail": "ROOT EXCEPTION"}, status_code=500)
@app.exception_handler(AuthException)
async def validation_exception_handler(request, exc: AuthException):
getLogger().info(f"Request Validation exception happend{sys.exc_info()}")
return JSONResponse({"detail": "CHILD EXCEPTION"}, status_code=404)
- create an endpoint to send request to see if logging formatter/processor works or not
@app.post("/")
async def post(body: Any = Body(...)):
if body == 1: raise AuthException("child exception")
if body == 2: raise ZeroDivisionError("root exception")
return JSONResponse({"Response": "test SUCCESS"}, status_code=200)
Environment (please complete the following information)
- OS: Win and Linux
- Python version: 3.7
- Framework and version [e.g. Django 2.1]: fastapi 0.63
- APM Server version: _
- Agent version: 6.*
Generally, the problem is ordering the middlewares in Starlette. https://github.com/encode/starlette/blob/master/starlette/applications.py#L83
ElasticAPM middleware is a "user middleware". The middlewares in Starlette are like wrappers over wrappers.
the ordering by response/exception processing is:
- Endpoint method controlled by decorators
@app.post(..),@app.get(..),.. -
ExceptionMiddlewarecontrolled byexception_handlerdecorator where the argument is notException - User middlewares (e.g. ElasticAPM middleware)
-
ServerErrorMiddlewarecontrolled byexception_handler(Exception)decorator (argument must beExceptionor500)
In the ordering above, you can see that ElasticAPM is under the ExceptionMiddleware. The ExceptionMiddleware overwrites the raised exceptions to JSONResponse (in your case) and the tracking information of the exception under is lost.
It might not be easy to fix the problem in this project, it may be easier to add a constructor argument to overwrite ExceptionMiddleware and ServerErrorMiddleware or define own the middleware ordering in Starlette.
👋 Just wanted to raise that I did attempt a re-ordering of middleware of my Starlette app (in this case FastAPI that extends Starlette) and unfortunately still got the same result as @realabja, where trace.id and transaction.id are missing from any logs for the handling of the error. (Other extras are still present, like event.dataset and service.name).
rough sketch (copy/paste their build_middleware_stack but put in ElasticAPM) - also to be clear, regardless of where I put the ElasticAPM middleware in this list of middleware, I get the same result.
class MyFastAPI(FastAPI):
def build_middleware_stack(self) -> ASGIApp:
debug = self.debug
error_handler = None
exception_handlers = {}
for key, value in self.exception_handlers.items():
if key in (500, Exception):
error_handler = value
else:
exception_handlers[key] = value
middleware = (
[
Middleware(ServerErrorMiddleware, handler=error_handler, debug=debug)
]
+ self.user_middleware
+ [
Middleware(
ExceptionMiddleware, handlers=exception_handlers, debug=debug
)
]
+ [Middleware(ElasticAPM, client=make_apm_client(elastic_config))]
)
app = self.router
for cls, options in reversed(middleware):
app = cls(app=app, **options)
return app
@magichair I wrote a wrapper on the router from fastapi to capture exception and do all apm stuff get trace id and transaction manuelly then reraise exception. Not the cleanest way but it works.
I'm not certain that this is a middleware-ordering issue. Because when the request starts, our middleware will get called, and start the transaction. Later, yes, the exception handling middleware will happen before ours, but the exception should still be active in the execution context. So when we fetch the transaction here it should be there, but apparently is not for some reason.
Some more investigation is in order. Sorry for the slow response here. Glad you have a workaround in the meantime.
@basepi I'm wondering if it has something to do with how the .end_transaction() occurs. Perhaps in FastAPI / event loop async there is some race condition where the transaction is ended before we get to these access logs and other logs.
It seems strange to me that the finally could occur before, but I also don't know enough about event loop and how FastAPI might be short circuiting or prioritizing things on the loop if it's an exception (just a theoretical, nothing to back up that claim).
https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L185
I'll point out that the capture_exception that is sent to APM's Error page does have the trace.id and transaction id - https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L176-L178 maybe the final raise ( https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L183 ) here goes off back to the event loop and the finally to end the transaction occurs immediately before FastAPI and it's logging can take over and log the exception and pass it through the middlewares.
I'm not sure if it's helpful, but maybe there is some historical context around these comments in the Flask module - https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/flask/init.py#L199-L201 perhaps they too ran into some race condition problems and had to defer the end_transaction until later. I'm not sure how that maps in a FastAPI world yet, but I figure it might help.
Sorry for the slow response here.
No worries whatsoever. We're all doing the best we can. I'm happy to help in any way I can.
Hey team! Do we have any way to handle this? This occurs with Django as well!
@marcoaleixo I haven't dug deep into this issue but I'm expecting the fix to be fairly FastAPI-specific. Would you mind opening a new issue for Django with some details?
@basepi any chance this could get picked up for next release? We're really struggling with this, as yes, there is trace.id in our ECS logs for almost everything except the really important logs that happen during the Starlette exception handling. So when we query in our logging for a particular trace.id it all looks ok, when there are clearly other errors occurring that are missing the trace. My team members are looking at adding, yet another distributed trace header into the mix - https://github.com/tomwojcik/starlette-context to remediate this.
Or any ideas on where I could jump in to help investigate? It's been awhile since I first hopped in and had my hands deep in the integration code here.
edit: I spent some time getting back in to this. I actually think that my problem is not in the FastAPI exception handler, but actually in the uvicorn.error logger.
https://github.com/encode/uvicorn/blob/c3c2db3f9ed8e331c6757c9c302ef21cf7338848/uvicorn/protocols/http/httptools_impl.py#L403-L413
In this block, the entire FastAPI app and all its middlewares all execute in the result = await app(...) call. So by the time we hit the unhandled "Exception in ASGI Application" much is lost, and specifically in this case the apm-agent-python will have already end_transaction() and therefore we wont have any trace information at this stage.
I think what I may end up needing to do is just add an exception logging middleware to print this same sort 5xx exception fallback, but within FastAPI
@magichair I'm not sure how we missed this in our 8.6 planning, but I'm going to take a look. :)
Edit: Oh, wait, just saw your edit. That sounds plausible. I'm still going to look into this, though, we should support handling the base Exception.
(I think) I found the root cause of our troubles. It mostly stems from: https://github.com/tiangolo/fastapi/blob/0.85.1/fastapi/applications.py#L156-L160
for key, value in self.exception_handlers.items():
if key in (500, Exception):
error_handler = value
else:
exception_handlers[key] = value
Here, FastAPI is building the middleware. It actually harkens back to my previous attempt at manipulating this (I just did it in the wrong order), https://github.com/elastic/apm-agent-python/issues/1110#issuecomment-938125906
The catch-all FastAPI exception handler that we register
app.exception_handler(Exception)(uncaught_exception_handler)
is getting "special cased" and pulled out of the list of exception_handlers and instead becomes the error_handler for the overarching ServerErrorMiddleware.
https://github.com/tiangolo/fastapi/blob/0.85.1/fastapi/applications.py#L162-L168
middleware = (
[Middleware(ServerErrorMiddleware, handler=error_handler, debug=debug)]
+ self.user_middleware
+ [
Middleware(
ExceptionMiddleware, handlers=exception_handlers, debug=debug
),
The self.user_middleware is where ElasticAPM ends up and as such, finishes and does all of its finally work like end_transaction() before the ServerErrorMiddleware kicks in and does its work.
@magichair Great research, thank you.
Perhaps we need to instrument ServerErrorMiddleware to make sure the transaction is available at that point. I'll have to think on this and dive the code a bit.
It appears that both Starlette and FastAPI automatically use ServerErrorMiddleware as the outer-most middleware. However, only FastAPI monkeys with the Exception handling. One potential solution is to monkeypatch that middleware class to do our end_transaction() there, instead of in the user middleware. But that's a bit more brittle than what we're doing now since we'd be reaching into their code instead of sticking to our middleware.
It still may be the best option. It would be a really simple wrapper and I don't know how to send the transaction context forward otherwise. 🤔
Just kidding, Starlette does it too so this will be good to get fixed: https://github.com/encode/starlette/blob/3453fd6b6259393f3744815b49af297c934f32d0/starlette/applications.py#L85-L99
I have a POC in #1674 but it's not working yet. My code isn't being hit, still trying to figure out why.
Now that's the APM monkeypatching magic I was looking for! Here I am mitigating over in user land 😛 Thank you for jumping in on this, I'm happy to help test in any way I can.