threading: Exception at shutdown on musllinux_1_2_i686
Bug report
Bug description:
When cibuildwheel runs the musllinux_1_2_i686 configuration, I often see exceptions logged when the end-of-process GC runs:
Exception ignored in: <function _DeleteDummyThreadOnDel.__del__ at 0xf2be5578>
Traceback (most recent call last):
File "/opt/_internal/cpython-3.13.0/lib/python3.13/threading.py", line 1383, in __del__
TypeError: 'NoneType' object does not support the context manager protocol
.Traceback (most recent call last):
File "<string>", line 1, in <module>
from tornado.ioloop import IOLoop; classname = lambda x: x.__class__.__name__; from tornado.platform.asyncio import AsyncIOMainLoop; AsyncIOMainLoop().install(); print(classname(IOLoop.current()))
It looks like something about musl libc is causing _active_limbo_lock to get GC'd before _thread_local_info._track_dummy_thread_ref. This happens on musllinux (1.2) on i686 builds. 64-bit builds (of x86 and arm) and manylinux builds are not affected. Older versions of python are not affected; I have not tested with 3.14 alphas but I think they are likely affected since the relevant code doesn't appear to have changed.
A complete log with this failure can be found at https://github.com/bdarnell/tornado/actions/runs/13459578244/job/37733273132
This issue is not a priority for me. I'm just going to skip this configuration in my CI and I'm filing an issue so I have something to point to.
CPython versions tested on:
3.13
Operating systems tested on:
Linux
Linked PRs
- gh-131537
Thanks for the heads-up. Do you have a reproducer without Tornado?
I am seeing this issue in Python 3.13.0 on Windows 10 under VSCode. I don't have a simple reproduction, as it has shown up after running a pytest suite of an app.
I stuck a traceback.print_stack(file=sys.stdout) as the first line of the __init__ function of class _DeleteDummyThreadOnDel to see this callstack
File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 1682, in _start_method_event
thread_info = _get_thread_info(True, 1)
File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 447, in _get_thread_info
thread_info = _create_thread_info(depth + 1)
File ".vscode\extensions\ms-python.debugpy-2025.1.2025022401-win32-x64\bundled\libs\debugpy\_vendored\pydevd\_pydevd_sys_monitoring\_pydevd_sys_monitoring.py", line 329, in _create_thread_info
t = threading.current_thread()
File ".pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1441, in current_thread
return _DummyThread()
File ".pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1411, in __init__
_DeleteDummyThreadOnDel(self)
and by the time the __del__ is called, _active_limbo_lock seems to be None.
If the __del__ can get called after _active_limbo_lock is "cleaned up" (assuming that's what is happening), could the fix be as simple as first checking _active_limbo_lock before the with _active_limbo_lock: call?
Maybe, but it would be good to figure out why it's None. Are there daemon threads active?
I'm testing an async FastAPI route via pytest inside VSCode. My app has no explicit threads. The code involved seems to be VSCode's instrumentation of the test run, about which I know next to nothing. Which is a long-winded way of saying why I don't know if there are daemon threads. Is there some code I could stick another traceback print in, to see who might be creating a daemon threads?
I'm not even certain that daemon threads are the culprit; they're just typically mischief-makers when it comes to threading. Given that this has occurred in only ASGI test suites, I'd be inclined to look for some common test dependency, not necessarily something on our end.
Would you mind bisecting to which test of yours is causing the error? That should give us a clearer idea of who to report this to.
After my pytest code has run (but its pytest fixtures have not yet been torn down), this code is running
455: @contextmanager
456: def start_blocking_portal(
457: backend: str = "asyncio", backend_options: dict[str, Any] | None = None
458: ) -> Generator[BlockingPortal, Any, None]:
459: """
460: Start a new event loop in a new thread and run a blocking portal in its main task.
461:
462: The parameters are the same as for :func:`~anyio.run`.
463:
464: :param backend: name of the backend
465: :param backend_options: backend options
466: :return: a context manager that yields a blocking portal
467:
468: .. versionchanged:: 3.0
469: Usage as a context manager is now required.
470:
471: """
472:
473: async def run_portal() -> None:
474: async with BlockingPortal() as portal_:
475: future.set_result(portal_)
476: await portal_.sleep_until_stopped()
477:
478: def run_blocking_portal() -> None:
479: if future.set_running_or_notify_cancel():
480: try:
481: _eventloop.run(
482: run_portal, backend=backend, backend_options=backend_options
483: )
484: except BaseException as exc:
485: if not future.done():
486: future.set_exception(exc)
487:
488: future: Future[BlockingPortal] = Future()
489: thread = Thread(target=run_blocking_portal, daemon=True)
490: thread.start()
491: try:
492: cancel_remaining_tasks = False
493: portal = future.result()
494: try:
495: yield portal ### cleanup activities seem to run off of here
496: except BaseException:
497: cancel_remaining_tasks = True
498: raise
499: finally:
500: try:
501: portal.call(portal.stop, cancel_remaining_tasks)
502: except RuntimeError:
503: pass
504: finally:
505: thread.join() ### DeleteDummyThreadOnDel.__init__ run off of this call
The thread.join() at line 505 somehow leads to the __init__ of class _DeleteDummyThreadOnDel being called. Then many pytest "finalizers" and other things run. I assume that VSCode is in the process of winding down the instance of python it launched to run pytest in. On this run, with more tracebacks and me pausing and single stepping in the debugger, the __del__ of class _DeleteDummyThreadOnDel doesn't run until very late in the shutdown process. Here's the output
======================== 1 passed in 584.78s (0:09:44) ========================
Exception ignored in: <function _DeleteDummyThreadOnDel.__del__ at 0x0000024707EE7100>
Traceback (most recent call last):
File "C:\Users\jeff1\.pyenv\pyenv-win\versions\3.13.0\Lib\threading.py", line 1385, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
According to datamodel.html#object
del() can be executed during interpreter shutdown. As a consequence, the global variables it needs to access (including other modules) may already have been deleted or set to None.
and this apparently is happening (inferred from _active_limbo_lock being None in the error I captured earlier and now suggested by the "...Python is likely shutting down" message captured here). Given that nothing prevents anyone from calling threading.current_thread() "late in the game", it seems that there's some risk in the __del__ method assuming it can still access that lock.
Ugh, daemon threads strike again. This looks like an actual regression, that code was added in 3.13: GH-114424
It's hard to tell without seeing the entire test case, but I'm roughly sure that this isn't anyio's fault. My interpretation of what's happening is that the daemon thread grabs the GIL just before finalization, then does something which screws up the globals e.g., a garbage collection, then releases the GIL in a bytecode boundary, and then the main thread has a dummy thread that can't handle whatever happened.
Checking if the _active_limbo_lock is None probably isn't a great idea. Either:
- Check
sys.is_finalizing()(but that's probably prone to races e.g., it could start finalizing while acquiring the lock). - Use a more robust solution for removing from
threading._active, presumably something like a weakref finalizer instead of a__del__.
cc @graingert as an anyio expert.
Sounds like a starlette TestClient missing a with, and then a GeneratorExit closing a BlockingPortal contextmanager?
re "...TestClient missing a with": our conftest.py has this fixture, and that's the only instance of TestClient I can see.
@pytest.fixture
def client():
with TestClient(app) as client:
yield client
re "...TestClient missing a
with": ourconftest.pyhas this fixture, and that's the only instance ofTestClientI can see.@pytest.fixture def client(): with TestClient(app) as client: yield client
This smells like it might be https://discuss.python.org/t/preventing-yield-inside-certain-context-managers/1091, try replacing it with
@pytest.fixture():
def client():
return TestClient(app)
def my_test(client):
with client:
...
I only skimmed that lengthy link, and I couldn't understand why the Starlette TestClient would qualify as the sort of context manage one shouldn't yield from. I'll definitely try it for the sake of seeing what happens. The problem I'm seeing is intermittent, so I'll need some time to try it thoroughly.
That said, I feel like I came in during the middle of the movie.
It seems that relying on __del__ that refers to an external lock can't be fundamentally sound, since the "rules of __del__" seems to allow for that lock to be gone when the __del__ finally runs.
But I don't understand the specific case that DeleteDummyThreadOnDel is covering.
And what do we do about the seemingly standard use of this TestClient as a pytest fixture? (See https://sqlmodel.tiangolo.com/tutorial/fastapi/tests/#add-more-tests for examples).
The notion that some context managers oughtn't be yielded from is surprising, and I'd like to understand better what are the characteristics that are stated in the linked discussion as
the only context managers that want to use forbid_yield are (1) cancel scopes, (2) nurseries, (3) context managers that are composed out of cancel scopes or nurseries.
Here’s some examples of context managers where I was not planning to use forbid_yield:
with socket_object: async with stream_object: with contextlib.suppress(...): with log_entry_and_exit():
How does one determine that a third party library context manager (e.g. TestClient) is a cancel scope or a nursery or composed out of cancel scopes or nurseries? I maybe vaguely understand what a "cancel scope" is, but have no idea, in this context, as to what a "nursery" is. What specifically in Starlette's TestClient is the thing that makes it not ok to yield from?
Trying to run my test without TestClient as a context manager fails, and reminded me why I made it a context manager to begin with. As described in https://fastapi.tiangolo.com/advanced/testing-events/, in order to get an ASGI lifespan activated by TestClient, you have to use it that way.
I will try to refactor my tests to not need client as a fixture at all, which hopefully will let me try the experiment successfully.
sorry yes, you still need to use it as a context manager - but instead of entering the contextmanager in the fixture you enter it in the test case itself, to avoid yielding through it.
I'm not personally familiar with starlette.TestClient, so I might be pattern-matching incorrectly here, but contextmanagers + yields are often problematic in async contexts. ASYNC101 and ASYNC119 has more discussion/links.
Thank you for those links! I made a public GitHub repo to try to share a reproducible test case.
https://github.com/LiveData-Inc/del-issue
Use poetry 2.1.1 to poetry install the project (tested on a Windows 10 machine.)
Set a breakpoint in routes.py create_user.
Then launch the pytest case as illustrated
When the problem reproduces (not always, sorry) you can see the output as shown
It seems that relying on del that refers to an external lock can't be fundamentally sound, since the "rules of del" seems to allow for that lock to be gone when the del finally runs.
The reference to the lock in the global namespace is gone, but if you save a reference to the lock on the _DeleteDummyThreadOnDel object you can still have a reference to it during __del__. I've used this before as a workaround for destructor-time issues. (The main thing to watch out for is the possibility that some thread was abruptly shut down while holding the lock).
When the problem reproduces (not always, sorry)
My one-liner repro appears to reproduce every time, FWIW. It brings in Tornado but just uses a small part of it. I haven't tried to dig into this because I don't have a good way to test in a musllinux environment.
python -c "from tornado.ioloop import IOLoop; classname = lambda x: x.__class__.__name__; from tornado.platform.asyncio import AsyncIOMainLoop; AsyncIOMainLoop().install(); print(classname(IOLoop.current()))"
Actually, where ought I be running that one-liner to see the __del__ issue? If I run it on my Windows 10 machine, I just see it output
AsyncIOMainLoop
Ah, I didn't realize you were working on windows. My example runs in a musllinux environment (as it says in the subject line) which I know nothing about except that cibuildwheel manages it for me.
Sorry for missing that!
I've been encountering this issue reliably 100% of the time with the PyCharm pydevd debugger. Applying the following patch to threading.py to keep a reference to the active dict and the global lock is less than ideal but lets the debugger continue.
def __init__(self, dummy_thread):
+ self._active_limbo_lock = _active_limbo_lock
+ self._active = _active
self._dummy_thread = dummy_thread
self._tident = dummy_thread.ident
# Put the thread on a thread local variable so that when
@@ -1380,7 +1381,7 @@
_thread_local_info._track_dummy_thread_ref = self
def __del__(self):
- with _active_limbo_lock:
+ with self._active_limbo_lock:
if _active.get(self._tident) is self._dummy_thread:
_active.pop(self._tident, None)
Oh it should probably just be:
def __del__(self, _active_limbo_lock=_active_limbo_lock, _active=_active): ...
I've been encountering this issue reliably 100% of the time with the PyCharm pydevd debugger. Applying the following patch to
threading.pyto keep a reference to the active dict and the global lock is less than ideal but lets the debugger continue.def __init__(self, dummy_thread):
self._active_limbo_lock = _active_limbo_lockself._active = _active self._dummy_thread = dummy_thread self._tident = dummy_thread.ident # Put the thread on a thread local variable so that when@@ -1380,7 +1381,7 @@ _thread_local_info._track_dummy_thread_ref = self
def __del__(self):
with _active_limbo_lock:
with self._active_limbo_lock: if _active.get(self._tident) is self._dummy_thread: _active.pop(self._tident, None)
Same for me on Mac OS 15.4 with the latest PyCharm(2025.1).
After a lot of investigation, I've finally found the cause of this bug. Here's a small repro:
import sys
import threading
import time
class Evil(threading.Thread):
def __init__(self):
self.ref = sys.modules[__name__]
super().__init__(target=self.run, daemon=True)
def run(self):
while True:
time.sleep(1)
Evil().start()
x = threading._DummyThread()
Or an even smaller repro:
import sys
import threading
threading.x = sys.modules[__name__]
x = threading._DummyThread()
This is an incredible heisenbug that happens under very specific circumstances in practice. For background info:
- During finalization,
sys.modulesis cleared and then the garbage collection is explicitly ran. This cleans up most modules and objects, butthreadingis one of the few that survives past this phase. - For any modules that live past the clearing, they have their globals set to
None. - After globals are cleared, the sysdict and builtins are cleared, preventing
__del__from running past that point, so the bug is also avoided there.
In order to trigger the bug, you need to create and delete a _DummyThread object after sys.modules has been cleared, but before builtins are destroyed, because otherwise the __del__ won't run.
So, with that in mind, here's how the repro works:
- Create a custom daemon thread object that holds a reference to the current module, keeping it alive until
threadingis destroyed. This is because allThreadobjects are stored in a globalthreading._activevariable. Note that this part of the reproducer is not specific tothreading; any module that survives past the initial garbage collection can be used. For example, I'm pretty sure that in Tornado's case,asynciois storing a reference to the module through a task that's stored in a global. You can simplify this part of the repro down to justthreading.x = sys.modules[__name__], but I think the daemon thread is better for practicality's sake. -
threadinghas its globals cleared, destroying the_active_limbo_lock, and also the_thread_local_infothat keeps a reference to the_DeleteDummyThreadOnDelobject. Normally, this which would trigger the__del__here and drop the thread. But, since__main__still has a reference to the_DummyThread, it's not yet destroyed. -
__main__is finally cleared, and since the_DummyThreaddoesn't have reference cycles, its reference count can hit zero without the GC, thus triggering the deallocator and causing the bug.
The reason this is so rare is for two reasons. First, modules generally don't hold references to one another for this long, as it only happens with specific cases, like something in asyncio taking longer than usual or in a daemon thread. Secondly, _DummyThread objects just aren't that common. They only appear when calling threading.current_thread in a thread that wasn't created by the current interpreter's threading module; i.e., in threads created using extensions, or in threads spawned by a different interpreter. In @TkTech's case, the latter is what's happening; the main interpreter is creating a thread, and then a subinterpreter is creating a _DummyThread.
The fix from https://github.com/python/cpython/pull/131537 looks fine to me, the pattern of keeping a ref to module globals is very common, this is done in many places like in asyncio.
This will be fixed in 3.13.6 and hopefully 3.14.0. The 3.14 backport hasn't yet landed because the branch is locked until October, but I don't see any reason why it wouldn't make it in. If not, it'll make it in 3.14.1.
Thanks for the report @bdarnell and @TkTech for the fix!