apm-agent-python icon indicating copy to clipboard operation
apm-agent-python copied to clipboard

trace.id is not available in fastapi exception handler

Open abi-jey opened this issue 4 years ago • 7 comments

Describe the bug: We have fastapi framework, and we add apm_client for starlette to the application, we want to generate trace.id to correlate logs when an exception happens. I have structlog package for the logging and import elasticapm.handlers.structlog.structlog_processor. We set exception handlers in fastapi application for our excpected handlers and also one for Exception class to capture the ones that we do not expect. I can get trace.id in child exceptions but not in the exception handler that is comparing to root.

My goal was to have log Exception types and have log correlation in them.

I am not sure if it is bug or feature request🤷‍♂️ To Reproduce

  1. create apm agent and fastapi app, add the apm agent to application
from fastapi import FastAPI
from elasticapm.contrib.starlette import make_apm_client
from elasticapm.contrib.starlette import ElasticAPM
apm = make_apm_client(config=config)
app = FastAPI()
app.add_middleware(ElasticAPM, client=apm)
  1. configure structlog
import structlog
from elasticapm.handlers.structlog import structlog_processor
def processor(logger, level, log_dict):
    return json.dumps(log_dict)

structlog.configure(processors=[structlog_processor, processor])

  1. create an child exceptions and root and child exception
class AuthException(Exception):
    def __ini__(self):
        super().__init__()


@app.exception_handler(Exception)
async def fallback_handler(request , exc: Exception):
    getLogger().info("Uncaught exception during request handling happend")
    return JSONResponse({"detail": "ROOT EXCEPTION"}, status_code=500)

@app.exception_handler(AuthException)
async def validation_exception_handler(request, exc: AuthException):
    getLogger().info(f"Request Validation exception happend{sys.exc_info()}")
    return JSONResponse({"detail": "CHILD EXCEPTION"}, status_code=404)
  1. create an endpoint to send request to see if logging formatter/processor works or not
@app.post("/")
async def post(body: Any = Body(...)):
    if body == 1: raise AuthException("child exception")
    if body == 2: raise ZeroDivisionError("root exception")
    return JSONResponse({"Response": "test SUCCESS"}, status_code=200)

Environment (please complete the following information)

  • OS: Win and Linux
  • Python version: 3.7
  • Framework and version [e.g. Django 2.1]: fastapi 0.63
  • APM Server version: _
  • Agent version: 6.*

abi-jey avatar Apr 22 '21 16:04 abi-jey

Generally, the problem is ordering the middlewares in Starlette. https://github.com/encode/starlette/blob/master/starlette/applications.py#L83

ElasticAPM middleware is a "user middleware". The middlewares in Starlette are like wrappers over wrappers.

the ordering by response/exception processing is:

  1. Endpoint method controlled by decorators @app.post(..), @app.get(..),..
  2. ExceptionMiddleware controlled by exception_handler decorator where the argument is not Exception
  3. User middlewares (e.g. ElasticAPM middleware)
  4. ServerErrorMiddleware controlled by exception_handler(Exception) decorator (argument must be Exception or 500)

In the ordering above, you can see that ElasticAPM is under the ExceptionMiddleware. The ExceptionMiddleware overwrites the raised exceptions to JSONResponse (in your case) and the tracking information of the exception under is lost.

It might not be easy to fix the problem in this project, it may be easier to add a constructor argument to overwrite ExceptionMiddleware and ServerErrorMiddleware or define own the middleware ordering in Starlette.

Sparkycz avatar May 07 '21 09:05 Sparkycz

👋 Just wanted to raise that I did attempt a re-ordering of middleware of my Starlette app (in this case FastAPI that extends Starlette) and unfortunately still got the same result as @realabja, where trace.id and transaction.id are missing from any logs for the handling of the error. (Other extras are still present, like event.dataset and service.name).

rough sketch (copy/paste their build_middleware_stack but put in ElasticAPM) - also to be clear, regardless of where I put the ElasticAPM middleware in this list of middleware, I get the same result.

class MyFastAPI(FastAPI):
    def build_middleware_stack(self) -> ASGIApp:
        debug = self.debug
        error_handler = None
        exception_handlers = {}

        for key, value in self.exception_handlers.items():
            if key in (500, Exception):
                error_handler = value
            else:
                exception_handlers[key] = value

        middleware = (
            [
                Middleware(ServerErrorMiddleware, handler=error_handler, debug=debug)
            ]
            + self.user_middleware
            + [
                Middleware(
                    ExceptionMiddleware, handlers=exception_handlers, debug=debug
                )
            ]
            + [Middleware(ElasticAPM, client=make_apm_client(elastic_config))]
        )

        app = self.router
        for cls, options in reversed(middleware):
            app = cls(app=app, **options)
        return app

magichair avatar Oct 07 '21 20:10 magichair

@magichair I wrote a wrapper on the router from fastapi to capture exception and do all apm stuff get trace id and transaction manuelly then reraise exception. Not the cleanest way but it works.

abi-jey avatar Oct 07 '21 21:10 abi-jey

I'm not certain that this is a middleware-ordering issue. Because when the request starts, our middleware will get called, and start the transaction. Later, yes, the exception handling middleware will happen before ours, but the exception should still be active in the execution context. So when we fetch the transaction here it should be there, but apparently is not for some reason.

Some more investigation is in order. Sorry for the slow response here. Glad you have a workaround in the meantime.

basepi avatar Oct 13 '21 15:10 basepi

@basepi I'm wondering if it has something to do with how the .end_transaction() occurs. Perhaps in FastAPI / event loop async there is some race condition where the transaction is ended before we get to these access logs and other logs. It seems strange to me that the finally could occur before, but I also don't know enough about event loop and how FastAPI might be short circuiting or prioritizing things on the loop if it's an exception (just a theoretical, nothing to back up that claim). https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L185

I'll point out that the capture_exception that is sent to APM's Error page does have the trace.id and transaction id - https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L176-L178 maybe the final raise ( https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/starlette/init.py#L183 ) here goes off back to the event loop and the finally to end the transaction occurs immediately before FastAPI and it's logging can take over and log the exception and pass it through the middlewares.

I'm not sure if it's helpful, but maybe there is some historical context around these comments in the Flask module - https://github.com/elastic/apm-agent-python/blob/master/elasticapm/contrib/flask/init.py#L199-L201 perhaps they too ran into some race condition problems and had to defer the end_transaction until later. I'm not sure how that maps in a FastAPI world yet, but I figure it might help.

Sorry for the slow response here.

No worries whatsoever. We're all doing the best we can. I'm happy to help in any way I can.

magichair avatar Oct 14 '21 13:10 magichair

Hey team! Do we have any way to handle this? This occurs with Django as well!

marcoaleixo avatar Aug 26 '22 14:08 marcoaleixo

@marcoaleixo I haven't dug deep into this issue but I'm expecting the fix to be fairly FastAPI-specific. Would you mind opening a new issue for Django with some details?

basepi avatar Aug 26 '22 16:08 basepi

@basepi any chance this could get picked up for next release? We're really struggling with this, as yes, there is trace.id in our ECS logs for almost everything except the really important logs that happen during the Starlette exception handling. So when we query in our logging for a particular trace.id it all looks ok, when there are clearly other errors occurring that are missing the trace. My team members are looking at adding, yet another distributed trace header into the mix - https://github.com/tomwojcik/starlette-context to remediate this.

Or any ideas on where I could jump in to help investigate? It's been awhile since I first hopped in and had my hands deep in the integration code here.

edit: I spent some time getting back in to this. I actually think that my problem is not in the FastAPI exception handler, but actually in the uvicorn.error logger. https://github.com/encode/uvicorn/blob/c3c2db3f9ed8e331c6757c9c302ef21cf7338848/uvicorn/protocols/http/httptools_impl.py#L403-L413 In this block, the entire FastAPI app and all its middlewares all execute in the result = await app(...) call. So by the time we hit the unhandled "Exception in ASGI Application" much is lost, and specifically in this case the apm-agent-python will have already end_transaction() and therefore we wont have any trace information at this stage. I think what I may end up needing to do is just add an exception logging middleware to print this same sort 5xx exception fallback, but within FastAPI

magichair avatar Oct 17 '22 23:10 magichair

@magichair I'm not sure how we missed this in our 8.6 planning, but I'm going to take a look. :)

Edit: Oh, wait, just saw your edit. That sounds plausible. I'm still going to look into this, though, we should support handling the base Exception.

basepi avatar Oct 18 '22 19:10 basepi

(I think) I found the root cause of our troubles. It mostly stems from: https://github.com/tiangolo/fastapi/blob/0.85.1/fastapi/applications.py#L156-L160

        for key, value in self.exception_handlers.items():
            if key in (500, Exception):
                error_handler = value
            else:
                exception_handlers[key] = value

Here, FastAPI is building the middleware. It actually harkens back to my previous attempt at manipulating this (I just did it in the wrong order), https://github.com/elastic/apm-agent-python/issues/1110#issuecomment-938125906

The catch-all FastAPI exception handler that we register

app.exception_handler(Exception)(uncaught_exception_handler)

is getting "special cased" and pulled out of the list of exception_handlers and instead becomes the error_handler for the overarching ServerErrorMiddleware. https://github.com/tiangolo/fastapi/blob/0.85.1/fastapi/applications.py#L162-L168

        middleware = (
            [Middleware(ServerErrorMiddleware, handler=error_handler, debug=debug)]
            + self.user_middleware
            + [
                Middleware(
                    ExceptionMiddleware, handlers=exception_handlers, debug=debug
                ),

The self.user_middleware is where ElasticAPM ends up and as such, finishes and does all of its finally work like end_transaction() before the ServerErrorMiddleware kicks in and does its work.

magichair avatar Oct 18 '22 23:10 magichair

@magichair Great research, thank you.

Perhaps we need to instrument ServerErrorMiddleware to make sure the transaction is available at that point. I'll have to think on this and dive the code a bit.

basepi avatar Oct 19 '22 18:10 basepi

It appears that both Starlette and FastAPI automatically use ServerErrorMiddleware as the outer-most middleware. However, only FastAPI monkeys with the Exception handling. One potential solution is to monkeypatch that middleware class to do our end_transaction() there, instead of in the user middleware. But that's a bit more brittle than what we're doing now since we'd be reaching into their code instead of sticking to our middleware.

It still may be the best option. It would be a really simple wrapper and I don't know how to send the transaction context forward otherwise. 🤔

basepi avatar Oct 19 '22 19:10 basepi

Just kidding, Starlette does it too so this will be good to get fixed: https://github.com/encode/starlette/blob/3453fd6b6259393f3744815b49af297c934f32d0/starlette/applications.py#L85-L99

basepi avatar Oct 19 '22 19:10 basepi

I have a POC in #1674 but it's not working yet. My code isn't being hit, still trying to figure out why.

basepi avatar Oct 19 '22 20:10 basepi

Now that's the APM monkeypatching magic I was looking for! Here I am mitigating over in user land 😛 Thank you for jumping in on this, I'm happy to help test in any way I can.

magichair avatar Oct 19 '22 20:10 magichair