opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

Running spans aren't ended when process terminates

Open Helveg opened this issue 2 months ago • 3 comments

Describe your environment

OS; Ubuntu Python version: 3.12.0 SDK version: 1.38.0 API version: 1.38.0

What happened?

When an exception interrupts a running stack of with tracer.start_as_current_span(...):, all the current spans correctly finalize and are ended, but when a signal terminates the process, this isn't the case:

    with tracer.start_as_current_span("normal_run_example") as span:
        span.set_attribute("example.attribute", "normal_execution")
        raise Exception("Normal Exception")

this records the exception, ends the span, and shuts down the trace provider and makes sure everything is exported before shutdown. this does not:

    with tracer.start_as_current_span("normal_run_example") as span:
        span.set_attribute("example.attribute", "normal_execution")
        time.sleep(0.2)
        with tracer.start_as_current_span("nested_run_example") as span2:
            time.sleep(0.5)
            os.kill(os.getpid(), signal.SIGTERM)
            time.sleep(10)

Steps to Reproduce

see examples above

Expected Result

Even when the process receives signals, the spans should be ended and the provider shut down, similar to how precautions are taken by using atexit handlers.

Actual Result

Running trace does not arrive at the collector.

Additional context

Currently the token needed to restore the context is only accessible inside opentelemetry.trace.use_span, but if we were to attach that as a private attribute to the created spans then from a signal handler we could do the following:

def shutdown_otel(signum=None):
    # Gracefully end the current span hierarchy if it is still running
    curr = trace.get_current_span()
    while curr and curr.is_recording():
        curr.end()
        token = getattr(curr, "_ctx_token", None) # <-- attribute set in `use_span`
        if token:
            detach(token)
        curr = trace.get_current_span()

    # Gracefully shutdown the trace provider
    try:
        provider.shutdown()
    except Exception:
        pass

    # Reraise the previous signal handler
    if signum is not None:
        signal.signal(signum, prev_handlers[signum])
        signal.raise_signal(signum)


# Run shutdown on interceptable termination signals
prev_handlers = {}
for s in ("SIGINT", "SIGTERM", "SIGHUP"):
    sig = getattr(signal, s, None)
    if sig is None:
        continue
    prev_handlers[sig] = signal.signal(sig, lambda signum, frame, _s=sig: shutdown_otel(_s))

Would you like to implement a fix?

Yes

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Helveg avatar Nov 20 '25 14:11 Helveg

Does this not work for you?

class TerminationError(Exception):
    pass

def handler(sig, frame):
    raise TerminationError()

signal.signal(signal.SIGTERM, handler)

with tracer.start_as_current_span("normal_run_example") as span:
    span.set_attribute("example.attribute", "normal_execution")
    os.kill(os.getpid(), signal.SIGTERM)

herin049 avatar Nov 21 '25 20:11 herin049

Oh, that does work yes. Is it expected behavior in Python?

Helveg avatar Nov 22 '25 09:11 Helveg

@Helveg See: https://docs.python.org/3/library/signal.html#execution-of-python-signal-handlers

I would stick with this approach and see how it works in practice. It's possible there might be some hidden bugs due to the fact that signal-based exceptions like KeyboardInterrupt extend BaseException and not Exception, but I am fairly confident these scenarios are already handled properly.

herin049 avatar Nov 26 '25 20:11 herin049