cpython icon indicating copy to clipboard operation
cpython copied to clipboard

threading: Exception at shutdown on musllinux_1_2_i686

Open bdarnell opened this issue 11 months ago • 22 comments

Bug report

Bug description:

When cibuildwheel runs the musllinux_1_2_i686 configuration, I often see exceptions logged when the end-of-process GC runs:

Exception ignored in: <function _DeleteDummyThreadOnDel.__del__ at 0xf2be5578>
  Traceback (most recent call last):
    File "/opt/_internal/cpython-3.13.0/lib/python3.13/threading.py", line 1383, in __del__
  TypeError: 'NoneType' object does not support the context manager protocol
  .Traceback (most recent call last):
    File "<string>", line 1, in <module>
      from tornado.ioloop import IOLoop; classname = lambda x: x.__class__.__name__; from tornado.platform.asyncio import AsyncIOMainLoop; AsyncIOMainLoop().install(); print(classname(IOLoop.current()))

It looks like something about musl libc is causing _active_limbo_lock to get GC'd before _thread_local_info._track_dummy_thread_ref. This happens on musllinux (1.2) on i686 builds. 64-bit builds (of x86 and arm) and manylinux builds are not affected. Older versions of python are not affected; I have not tested with 3.14 alphas but I think they are likely affected since the relevant code doesn't appear to have changed.

A complete log with this failure can be found at https://github.com/bdarnell/tornado/actions/runs/13459578244/job/37733273132

This issue is not a priority for me. I'm just going to skip this configuration in my CI and I'm filing an issue so I have something to point to.

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

  • gh-131537

bdarnell avatar Feb 24 '25 20:02 bdarnell

Thanks for the heads-up. Do you have a reproducer without Tornado?

ZeroIntensity avatar Feb 25 '25 12:02 ZeroIntensity

I am seeing this issue in Python 3.13.0 on Windows 10 under VSCode. I don't have a simple reproduction, as it has shown up after running a pytest suite of an app.

I stuck a traceback.print_stack(file=sys.stdout) as the first line of the __init__ function of class _DeleteDummyThreadOnDel to see this callstack

  File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 1682, in _start_method_event
    thread_info = _get_thread_info(True, 1)
  File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 447, in _get_thread_info
    thread_info = _create_thread_info(depth + 1)
  File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 329, in _create_thread_info
    t = threading.current_thread()
  File ".pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1441, in current_thread
    return _DummyThread()
  File ".pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1411, in __init__
    _DeleteDummyThreadOnDel(self)

and by the time the __del__ is called, _active_limbo_lock seems to be None.

If the __del__ can get called after _active_limbo_lock is "cleaned up" (assuming that's what is happening), could the fix be as simple as first checking _active_limbo_lock before the with _active_limbo_lock: call?

jrobbins-LiveData avatar Mar 02 '25 15:03 jrobbins-LiveData

Maybe, but it would be good to figure out why it's None. Are there daemon threads active?

ZeroIntensity avatar Mar 02 '25 15:03 ZeroIntensity

I'm testing an async FastAPI route via pytest inside VSCode. My app has no explicit threads. The code involved seems to be VSCode's instrumentation of the test run, about which I know next to nothing. Which is a long-winded way of saying why I don't know if there are daemon threads. Is there some code I could stick another traceback print in, to see who might be creating a daemon threads?

jrobbins-LiveData avatar Mar 02 '25 16:03 jrobbins-LiveData

I'm not even certain that daemon threads are the culprit; they're just typically mischief-makers when it comes to threading. Given that this has occurred in only ASGI test suites, I'd be inclined to look for some common test dependency, not necessarily something on our end.

Would you mind bisecting to which test of yours is causing the error? That should give us a clearer idea of who to report this to.

ZeroIntensity avatar Mar 02 '25 17:03 ZeroIntensity

After my pytest code has run (but its pytest fixtures have not yet been torn down), this code is running

455: @contextmanager
456: def start_blocking_portal(
457:     backend: str = "asyncio", backend_options: dict[str, Any] | None = None
458: ) -> Generator[BlockingPortal, Any, None]:
459:     """
460:     Start a new event loop in a new thread and run a blocking portal in its main task.
461: 
462:     The parameters are the same as for :func:`~anyio.run`.
463: 
464:     :param backend: name of the backend
465:     :param backend_options: backend options
466:     :return: a context manager that yields a blocking portal
467: 
468:     .. versionchanged:: 3.0
469:         Usage as a context manager is now required.
470: 
471:     """
472: 
473:     async def run_portal() -> None:
474:         async with BlockingPortal() as portal_:
475:             future.set_result(portal_)
476:             await portal_.sleep_until_stopped()
477: 
478:     def run_blocking_portal() -> None:
479:         if future.set_running_or_notify_cancel():
480:             try:
481:                 _eventloop.run(
482:                     run_portal, backend=backend, backend_options=backend_options
483:                 )
484:             except BaseException as exc:
485:                 if not future.done():
486:                     future.set_exception(exc)
487: 
488:     future: Future[BlockingPortal] = Future()
489:     thread = Thread(target=run_blocking_portal, daemon=True)
490:     thread.start()
491:     try:
492:         cancel_remaining_tasks = False
493:         portal = future.result()
494:         try:
495:             yield portal  ### cleanup activities seem to run off of here
496:         except BaseException:
497:             cancel_remaining_tasks = True
498:             raise
499:         finally:
500:             try:
501:                 portal.call(portal.stop, cancel_remaining_tasks)
502:             except RuntimeError:
503:                 pass
504:     finally:
505:         thread.join()  ### DeleteDummyThreadOnDel.__init__ run off of this call

The thread.join() at line 505 somehow leads to the __init__ of class _DeleteDummyThreadOnDel being called. Then many pytest "finalizers" and other things run. I assume that VSCode is in the process of winding down the instance of python it launched to run pytest in. On this run, with more tracebacks and me pausing and single stepping in the debugger, the __del__ of class _DeleteDummyThreadOnDel doesn't run until very late in the shutdown process. Here's the output

======================== 1 passed in 584.78s (0:09:44) ========================
Exception ignored in: <function _DeleteDummyThreadOnDel.__del__ at 0x0000024707EE7100>
Traceback (most recent call last):
  File "C:\Users\jeff1\.pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1385, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down

According to datamodel.html#object

del() can be executed during interpreter shutdown. As a consequence, the global variables it needs to access (including other modules) may already have been deleted or set to None.

and this apparently is happening (inferred from _active_limbo_lock being None in the error I captured earlier and now suggested by the "...Python is likely shutting down" message captured here). Given that nothing prevents anyone from calling threading.current_thread() "late in the game", it seems that there's some risk in the __del__ method assuming it can still access that lock.

jrobbins-LiveData avatar Mar 03 '25 00:03 jrobbins-LiveData

Ugh, daemon threads strike again. This looks like an actual regression, that code was added in 3.13: GH-114424

It's hard to tell without seeing the entire test case, but I'm roughly sure that this isn't anyio's fault. My interpretation of what's happening is that the daemon thread grabs the GIL just before finalization, then does something which screws up the globals e.g., a garbage collection, then releases the GIL in a bytecode boundary, and then the main thread has a dummy thread that can't handle whatever happened.

Checking if the _active_limbo_lock is None probably isn't a great idea. Either:

  • Check sys.is_finalizing() (but that's probably prone to races e.g., it could start finalizing while acquiring the lock).
  • Use a more robust solution for removing from threading._active, presumably something like a weakref finalizer instead of a __del__.

cc @graingert as an anyio expert.

ZeroIntensity avatar Mar 03 '25 01:03 ZeroIntensity

Sounds like a starlette TestClient missing a with, and then a GeneratorExit closing a BlockingPortal contextmanager?

graingert avatar Mar 03 '25 08:03 graingert

re "...TestClient missing a with": our conftest.py has this fixture, and that's the only instance of TestClient I can see.

@pytest.fixture
def client():
    with TestClient(app) as client:
        yield client

jrobbins-LiveData avatar Mar 03 '25 11:03 jrobbins-LiveData

re "...TestClient missing a with": our conftest.py has this fixture, and that's the only instance of TestClient I can see.

@pytest.fixture
def client():
    with TestClient(app) as client:
        yield client

This smells like it might be https://discuss.python.org/t/preventing-yield-inside-certain-context-managers/1091, try replacing it with

@pytest.fixture():
def client():
    return TestClient(app)

def my_test(client):
    with client:
        ...

jakkdl avatar Mar 03 '25 12:03 jakkdl

I only skimmed that lengthy link, and I couldn't understand why the Starlette TestClient would qualify as the sort of context manage one shouldn't yield from. I'll definitely try it for the sake of seeing what happens. The problem I'm seeing is intermittent, so I'll need some time to try it thoroughly.

That said, I feel like I came in during the middle of the movie.

It seems that relying on __del__ that refers to an external lock can't be fundamentally sound, since the "rules of __del__" seems to allow for that lock to be gone when the __del__ finally runs.

But I don't understand the specific case that DeleteDummyThreadOnDel is covering.

And what do we do about the seemingly standard use of this TestClient as a pytest fixture? (See https://sqlmodel.tiangolo.com/tutorial/fastapi/tests/#add-more-tests for examples).

The notion that some context managers oughtn't be yielded from is surprising, and I'd like to understand better what are the characteristics that are stated in the linked discussion as

the only context managers that want to use forbid_yield are (1) cancel scopes, (2) nurseries, (3) context managers that are composed out of cancel scopes or nurseries.

Here’s some examples of context managers where I was not planning to use forbid_yield:

with socket_object: async with stream_object: with contextlib.suppress(...): with log_entry_and_exit():

How does one determine that a third party library context manager (e.g. TestClient) is a cancel scope or a nursery or composed out of cancel scopes or nurseries? I maybe vaguely understand what a "cancel scope" is, but have no idea, in this context, as to what a "nursery" is. What specifically in Starlette's TestClient is the thing that makes it not ok to yield from?

jrobbins-LiveData avatar Mar 03 '25 13:03 jrobbins-LiveData

Trying to run my test without TestClient as a context manager fails, and reminded me why I made it a context manager to begin with. As described in https://fastapi.tiangolo.com/advanced/testing-events/, in order to get an ASGI lifespan activated by TestClient, you have to use it that way.

I will try to refactor my tests to not need client as a fixture at all, which hopefully will let me try the experiment successfully.

jrobbins-LiveData avatar Mar 03 '25 13:03 jrobbins-LiveData

sorry yes, you still need to use it as a context manager - but instead of entering the contextmanager in the fixture you enter it in the test case itself, to avoid yielding through it.

I'm not personally familiar with starlette.TestClient, so I might be pattern-matching incorrectly here, but contextmanagers + yields are often problematic in async contexts. ASYNC101 and ASYNC119 has more discussion/links.

jakkdl avatar Mar 03 '25 14:03 jakkdl

Thank you for those links! I made a public GitHub repo to try to share a reproducible test case.

https://github.com/LiveData-Inc/del-issue

Use poetry 2.1.1 to poetry install the project (tested on a Windows 10 machine.)

Set a breakpoint in routes.py create_user.

Image

Then launch the pytest case as illustrated

Image

When the problem reproduces (not always, sorry) you can see the output as shown

Image

jrobbins-LiveData avatar Mar 03 '25 14:03 jrobbins-LiveData

It seems that relying on del that refers to an external lock can't be fundamentally sound, since the "rules of del" seems to allow for that lock to be gone when the del finally runs.

The reference to the lock in the global namespace is gone, but if you save a reference to the lock on the _DeleteDummyThreadOnDel object you can still have a reference to it during __del__. I've used this before as a workaround for destructor-time issues. (The main thing to watch out for is the possibility that some thread was abruptly shut down while holding the lock).

When the problem reproduces (not always, sorry)

My one-liner repro appears to reproduce every time, FWIW. It brings in Tornado but just uses a small part of it. I haven't tried to dig into this because I don't have a good way to test in a musllinux environment.

python -c "from tornado.ioloop import IOLoop; classname = lambda x: x.__class__.__name__; from tornado.platform.asyncio import AsyncIOMainLoop; AsyncIOMainLoop().install(); print(classname(IOLoop.current()))"

bdarnell avatar Mar 04 '25 16:03 bdarnell

Actually, where ought I be running that one-liner to see the __del__ issue? If I run it on my Windows 10 machine, I just see it output

AsyncIOMainLoop

jrobbins-LiveData avatar Mar 04 '25 18:03 jrobbins-LiveData

Ah, I didn't realize you were working on windows. My example runs in a musllinux environment (as it says in the subject line) which I know nothing about except that cibuildwheel manages it for me.

bdarnell avatar Mar 04 '25 18:03 bdarnell

Sorry for missing that!

jrobbins-LiveData avatar Mar 04 '25 18:03 jrobbins-LiveData

I've been encountering this issue reliably 100% of the time with the PyCharm pydevd debugger. Applying the following patch to threading.py to keep a reference to the active dict and the global lock is less than ideal but lets the debugger continue.

     def __init__(self, dummy_thread):
+        self._active_limbo_lock = _active_limbo_lock
+        self._active = _active
         self._dummy_thread = dummy_thread
         self._tident = dummy_thread.ident
         # Put the thread on a thread local variable so that when
@@ -1380,7 +1381,7 @@
         _thread_local_info._track_dummy_thread_ref = self
 
     def __del__(self):
-        with _active_limbo_lock:
+        with self._active_limbo_lock:
             if _active.get(self._tident) is self._dummy_thread:
                 _active.pop(self._tident, None)

TkTech avatar Mar 09 '25 06:03 TkTech

Oh it should probably just be:

     def __del__(self, _active_limbo_lock=_active_limbo_lock, _active=_active): ...

graingert avatar Mar 09 '25 09:03 graingert

I've been encountering this issue reliably 100% of the time with the PyCharm pydevd debugger. Applying the following patch to threading.py to keep a reference to the active dict and the global lock is less than ideal but lets the debugger continue.

 def __init__(self, dummy_thread):
  •    self._active_limbo_lock = _active_limbo_lock
    
  •    self._active = _active
       self._dummy_thread = dummy_thread
       self._tident = dummy_thread.ident
       # Put the thread on a thread local variable so that when
    

@@ -1380,7 +1381,7 @@ _thread_local_info._track_dummy_thread_ref = self

 def __del__(self):
  •    with _active_limbo_lock:
    
  •    with self._active_limbo_lock:
           if _active.get(self._tident) is self._dummy_thread:
               _active.pop(self._tident, None)
    

Same for me on Mac OS 15.4 with the latest PyCharm(2025.1).

liorwavebl avatar May 13 '25 07:05 liorwavebl

After a lot of investigation, I've finally found the cause of this bug. Here's a small repro:

import sys
import threading
import time

class Evil(threading.Thread):
    def __init__(self):
        self.ref = sys.modules[__name__]
        super().__init__(target=self.run, daemon=True)

    def run(self):
        while True:
            time.sleep(1)

Evil().start()
x = threading._DummyThread()

Or an even smaller repro:

import sys
import threading

threading.x = sys.modules[__name__]
x = threading._DummyThread()

This is an incredible heisenbug that happens under very specific circumstances in practice. For background info:

  1. During finalization, sys.modules is cleared and then the garbage collection is explicitly ran. This cleans up most modules and objects, but threading is one of the few that survives past this phase.
  2. For any modules that live past the clearing, they have their globals set to None.
  3. After globals are cleared, the sysdict and builtins are cleared, preventing __del__ from running past that point, so the bug is also avoided there.

In order to trigger the bug, you need to create and delete a _DummyThread object after sys.modules has been cleared, but before builtins are destroyed, because otherwise the __del__ won't run.

So, with that in mind, here's how the repro works:

  1. Create a custom daemon thread object that holds a reference to the current module, keeping it alive until threading is destroyed. This is because all Thread objects are stored in a global threading._active variable. Note that this part of the reproducer is not specific to threading; any module that survives past the initial garbage collection can be used. For example, I'm pretty sure that in Tornado's case, asyncio is storing a reference to the module through a task that's stored in a global. You can simplify this part of the repro down to just threading.x = sys.modules[__name__], but I think the daemon thread is better for practicality's sake.
  2. threading has its globals cleared, destroying the _active_limbo_lock, and also the _thread_local_info that keeps a reference to the _DeleteDummyThreadOnDel object. Normally, this which would trigger the __del__ here and drop the thread. But, since __main__ still has a reference to the _DummyThread, it's not yet destroyed.
  3. __main__ is finally cleared, and since the _DummyThread doesn't have reference cycles, its reference count can hit zero without the GC, thus triggering the deallocator and causing the bug.

The reason this is so rare is for two reasons. First, modules generally don't hold references to one another for this long, as it only happens with specific cases, like something in asyncio taking longer than usual or in a daemon thread. Secondly, _DummyThread objects just aren't that common. They only appear when calling threading.current_thread in a thread that wasn't created by the current interpreter's threading module; i.e., in threads created using extensions, or in threads spawned by a different interpreter. In @TkTech's case, the latter is what's happening; the main interpreter is creating a thread, and then a subinterpreter is creating a _DummyThread.

ZeroIntensity avatar Jun 27 '25 23:06 ZeroIntensity

The fix from https://github.com/python/cpython/pull/131537 looks fine to me, the pattern of keeping a ref to module globals is very common, this is done in many places like in asyncio.

kumaraditya303 avatar Jul 21 '25 09:07 kumaraditya303

This will be fixed in 3.13.6 and hopefully 3.14.0. The 3.14 backport hasn't yet landed because the branch is locked until October, but I don't see any reason why it wouldn't make it in. If not, it'll make it in 3.14.1.

Thanks for the report @bdarnell and @TkTech for the fix!

ZeroIntensity avatar Jul 25 '25 15:07 ZeroIntensity