confluent-kafka-python icon indicating copy to clipboard operation
confluent-kafka-python copied to clipboard

Segmentation fault on GC runs in multithreaded context

Open progval opened this issue 1 year ago • 4 comments

Description

I am getting the following crash in very specific conditions, while the Python GC is running.

I can reliably reproduce it in a pytest test suite of an (unfortunately) non-trivial code base. Each test of the suite starts a producer, writes a few messages, then starts a multiprocessing.Manager, then a consumer and consumes topics to the end.

The third case to run segfaults when I manually trigger the GC in a specific place (or the fourth, when the GC eventually runs on its own).

==3552160== Invalid read of size 4
==3552160==    at 0x4A28B17: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160==    by 0x4A2F96F: thrd_join@@GLIBC_2.34 (thrd_join.c:26)
==3552160==    by 0x774B3DB: rd_kafka_destroy_app (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552160==    by 0x75A86C3: Producer_dealloc (Producer.c:103)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.c:2390)
==3552160==    by 0x5CCB14: UnknownInlinedFun (object.h:538)
==3552160==    by 0x5CCB14: UnknownInlinedFun (dictobject.c:5566)
==3552160==    by 0x5CCB14: subtype_clear.lto_priv.0 (typeobject.c:1288)
==3552160==    by 0x50ABCD: UnknownInlinedFun (gcmodule.c:1013)
==3552160==    by 0x50ABCD: gc_collect_main.lto_priv.0 (gcmodule.c:1287)
==3552160==    by 0x6303E6: gc_collect_with_callback.lto_priv.0 (gcmodule.c:1400)
==3552160==    by 0x6756CF: UnknownInlinedFun (gcmodule.c:1538)
==3552160==    by 0x6756CF: gc_collect.lto_priv.0 (gcmodule.c.h:110)
==3552160==    by 0x53ACF6: cfunction_vectorcall_FASTCALL_KEYWORDS.lto_priv.0 (methodobject.c:443)
==3552160==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552160==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552160==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552160==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552160==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552160==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552160==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552160==  Address 0x1201f990 is not stack'd, malloc'd or (recently) free'd
==3552160== 
Fatal Python error: Segmentation fault

Current thread 0x0000000004b80880 (most recent call first):
  Garbage-collecting
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 348 in progress_worker
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 257 in _process_worker
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 71 in _launch
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19 in __init__
  File "/usr/lib/python3.11/multiprocessing/context.py", line 281 in _Popen
  File "/usr/lib/python3.11/multiprocessing/process.py", line 121 in start
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 761 in _spawn_process
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 750 in _launch_processes
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 723 in _start_executor_manager_thread
  File "/usr/lib/python3.11/concurrent/futures/process.py", line 785 in submit
  File "/home/dev/swh-environment/swh-dataset/build/__editable__.swh.dataset-1.5.2.dev7+g4fccea4.d20240523-py3-none-any/swh/dataset/journalprocessor.py", line 316 in run
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 170 in test_parallel_journal_processor_origin_visit_status
  File "/home/dev/swh-environment/swh-dataset/swh/dataset/test/test_journal_processor.py", line 50 in newf
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/python.py", line 1772 in runtest
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 172 in pytest_runtest_call
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 240 in <lambda>
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 340 in from_call
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 239 in call_and_report
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 339 in _main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/dev/.local/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 174 in main
  File "/home/dev/.local/lib/python3.11/site-packages/_pytest/config/__init__.py", line 197 in console_main
  File "/home/dev/.local/bin/pytest", line 8 in <module>

Extension modules: confluent_kafka.cimpl, msgpack._cmsgpack, psutil._psutil_linux, psutil._psutil_posix, psycopg_c.pq, psycopg_c._psycopg, markupsafe._speedups, _brotli, simplejson._speedups, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils, tornado.speedups, plyvel._plyvel (total: 20)
==3552160== 
==3552160== Process terminating with default action of signal 11 (SIGSEGV)
==3552160==    at 0x4A28E2C: __pthread_kill_implementation (pthread_kill.c:44)
==3552160==    by 0x49D9FB1: raise (raise.c:26)
==3552160==    by 0x49DA04F: ??? (in /usr/lib/x86_64-linux-gnu/libc.so.6)
==3552160==    by 0x4A28B16: __pthread_clockjoin_ex (pthread_join_common.c:43)
==3552160== 
==3552160== HEAP SUMMARY:
==3552160==     in use at exit: 29,294,834 bytes in 30,069 blocks
==3552160==   total heap usage: 219,797 allocs, 189,728 frees, 522,042,917 bytes allocated
==3552160== 
==3552160== LEAK SUMMARY:
==3552160==    definitely lost: 4,547 bytes in 25 blocks
==3552160==    indirectly lost: 12,610 bytes in 30 blocks
==3552160==      possibly lost: 423,923 bytes in 214 blocks
==3552160==    still reachable: 28,853,754 bytes in 29,800 blocks
==3552160==         suppressed: 0 bytes in 0 blocks
==3552160== Rerun with --leak-check=full to see details of leaked memory
==3552160== 
==3552160== For lists of detected and suppressed errors, rerun with: -s
==3552160== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==3552159== 
==3552159== Process terminating with default action of signal 15 (SIGTERM)
==3552159==    at 0x4A23E96: __futex_abstimed_wait_common64 (futex-internal.c:57)
==3552159==    by 0x4A23E96: __futex_abstimed_wait_common (futex-internal.c:87)
==3552159==    by 0x4A2683B: __pthread_cond_wait_common (pthread_cond_wait.c:503)
==3552159==    by 0x4A2683B: pthread_cond_timedwait@@GLIBC_2.3.2 (pthread_cond_wait.c:643)
==3552159==    by 0x4A2F5D8: cnd_timedwait@@GLIBC_2.34 (cnd_timedwait.c:27)
==3552159==    by 0x7782B10: rd_kafka_q_serve_rkmessages (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x7743C3B: rd_kafka_consume_batch_queue (in /home/dev/.local/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-55260171.so.1)
==3552159==    by 0x75A5BB6: Consumer_consume (Consumer.c:1056)
==3552159==    by 0x550BBD: method_vectorcall_VARARGS_KEYWORDS.lto_priv.0 (descrobject.c:364)
==3552159==    by 0x53AC2B: UnknownInlinedFun (pycore_call.h:92)
==3552159==    by 0x53AC2B: PyObject_Vectorcall (call.c:299)
==3552159==    by 0x52B93F: _PyEval_EvalFrameDefault (ceval.c:4772)
==3552159==    by 0x5855A3: UnknownInlinedFun (pycore_ceval.h:73)
==3552159==    by 0x5855A3: UnknownInlinedFun (ceval.c:6435)
==3552159==    by 0x5855A3: UnknownInlinedFun (call.c:393)
==3552159==    by 0x5855A3: _PyObject_VectorcallTstate.lto_priv.0 (pycore_call.h:92)
==3552159==    by 0x585173: method_vectorcall.lto_priv.0 (classobject.c:59)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:257)
==3552159==    by 0x56A380: UnknownInlinedFun (call.c:328)
==3552159==    by 0x56A380: PyObject_Call (call.c:355)
==3552159== 
==3552159== HEAP SUMMARY:
==3552159==     in use at exit: 29,630,515 bytes in 30,288 blocks
==3552159==   total heap usage: 220,309 allocs, 190,021 frees, 522,280,354 bytes allocated
==3552159== 
==3552143== 
==3552143== HEAP SUMMARY:
==3552143==     in use at exit: 29,243,757 bytes in 29,955 blocks
==3552143==   total heap usage: 218,912 allocs, 188,957 frees, 522,173,350 bytes allocated
==3552143== 
==3552143== LEAK SUMMARY:
==3552143==    definitely lost: 576 bytes in 18 blocks
==3552143==    indirectly lost: 0 bytes in 0 blocks
==3552143==      possibly lost: 371,032 bytes in 200 blocks
==3552143==    still reachable: 28,872,149 bytes in 29,737 blocks
==3552143==         suppressed: 0 bytes in 0 blocks
==3552143== Rerun with --leak-check=full to see details of leaked memory
==3552143== 
==3552143== For lists of detected and suppressed errors, rerun with: -s
==3552143== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==3552159== LEAK SUMMARY:
==3552159==    definitely lost: 608 bytes in 19 blocks
==3552159==    indirectly lost: 0 bytes in 0 blocks
==3552159==      possibly lost: 676,795 bytes in 245 blocks
==3552159==    still reachable: 28,953,112 bytes in 30,024 blocks
==3552159==         suppressed: 0 bytes in 0 blocks
==3552159== Rerun with --leak-check=full to see details of leaked memory
==3552159== 
==3552159== For lists of detected and suppressed errors, rerun with: -s
==3552159== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

How to reproduce

I'm afraid I was not able to reproduce it with a simple script.

Clone https://gitlab.softwareheritage.org/swh/devel/swh-dataset.git , check out f8b5c8665ac802f8b9dd15feb11dad2c2eadaee3, then applying this patch (segfaults also happen without it, but it makes them more deterministic):

diff --git a/swh/dataset/journalprocessor.py b/swh/dataset/journalprocessor.py
index ab62cb1..87bdb32 100644
--- a/swh/dataset/journalprocessor.py
+++ b/swh/dataset/journalprocessor.py
@@ -24,6 +24,7 @@
     Set,
     Tuple,
 )
+import gc

 from confluent_kafka import Message, TopicPartition
 import tqdm
@@ -344,6 +345,7 @@ def progress_worker(self, queue=None) -> None:
         active_workers = self.processes
         offset_diff = sum((hi - lo) for lo, hi in self.offsets.values())
         desc = "  - Export"
+        gc.collect()
         with tqdm.tqdm(total=offset_diff, desc=desc, unit_scale=True) as pbar:
             while active_workers:
                 item = queue.get()
Installed Python packages
androguard==3.4.0a1
apksigcopier==1.1.1
appdirs==1.4.4
argcomplete==2.0.0
asn1crypto==1.5.1
asttokens==2.2.1
attrs==22.2.0
backcall==0.2.0
beautifulsoup4==4.11.2
beniget==0.4.1
binwalk==2.3.3
black==22.10.0
Brotli==1.0.9
cassandra-pylib==0.0.0
certifi==2022.9.24
cfgv==3.3.1
chardet==5.1.0
charset-normalizer==3.0.1
click==8.1.3
colorama==0.4.6
contourpy==1.0.7
cryptography==38.0.4
cssselect==1.2.0
cupshelpers==1.0
cycler==0.11.0
dbus-python==1.3.2
decorator==5.1.1
defusedxml==0.7.1
diffoscope==240
distlib==0.3.6
distro==1.8.0
dulwich==0.21.2
EditorConfig==0.12.3
elementpath==2.5.3
enjarify==1.0.3
executing==1.2.0
fastimport==0.9.14
filelock==3.9.0
fonttools==4.38.0
fs==2.4.16
gast==0.5.2
GDAL==3.6.2
gunicorn==20.1.0
html5lib==1.1
httpie==3.2.1
httplib2==0.20.4
identify==2.5.17
idna==3.3
importlib-metadata==4.12.0
iotop==0.6
ipython==8.5.0
jedi==0.18.2
jsbeautifier==1.14.4
jsondiff==2.0.0
kiwisolver==0.0.0
libarchive-c==2.9
libevdev==0.5
lxml==4.9.2
lz4==4.0.2+dfsg
markdown-it-py==2.1.0
matplotlib==3.6.3
matplotlib-inline==0.1.6
mdurl==0.1.2
mechanize==0.4.8
more-itertools==8.10.0
mpmath==0.0.0
msgpack==1.0.3
multidict==6.0.4
mypy-extensions==1.0.0
netifaces==0.11.0
networkx==2.8.8
nodeenv==0.13.4
numpy==1.24.2
olefile==0.46
oscrypto==1.3.0
packaging==23.0
parso==0.8.3
pathspec==0.11.1
pbr==5.10.0
pdfminer.six==20221105
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.6.0
ply==3.11
powerline-status==2.8.3
pre-commit==3.0.4
progressbar==2.5
prompt-toolkit==3.0.36
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.0.0
pycairo==1.20.1
pycryptodomex==3.11.0
pycups==2.0.1
pycurl==7.45.2
pydot==1.4.2
Pygments==2.14.0
PyGObject==3.42.2
pygraphviz==1.7
PyOpenGL==3.1.6
pyparsing==3.0.9
pypdf==3.4.1
pyperclip==1.8.2
pyprof2calltree==1.4.5
PyQt5==5.15.9
PyQt5-sip==12.11.1
PyQt6==6.4.2
PyQt6-sip==13.4.1
pyqtgraph==0.13.1
PySimpleSOAP==1.16.2
pysmbc==1.0.23
PySocks==1.7.1
pyte==0.8.0
python-apt==2.6.0
python-dateutil==2.8.2
python-debian==0.1.49
python-debianbts==4.0.1
python-magic==0.4.26
pythran==0.11.0
pytz==2022.7.1
pyudev==0.24.0
pyxattr==0.8.1
PyYAML==6.0
reportbug==12.0.0
requests==2.28.1
requests-toolbelt==0.10.1
rich==13.3.1
rpm==4.18.0
rubber==1.6.0
scipy==1.10.1
scour==0.38.2
six==1.16.0
soupsieve==2.3.2
stack-data==0.6.2
stevedore==4.0.2
sympy==1.11.1
tlsh==0.2.0
tqdm==4.64.1
traitlets==5.5.0
ufoLib2==0.14.0
unicodedata2==15.0.0
urllib3==1.26.12
virtualenv==20.17.1+ds
virtualenv-clone==0.3.0
virtualenvwrapper==4.8.4
wcwidth==0.2.5
webencodings==0.5.1
xkcdpass==1.19.3
xmlschema==1.10.0
xmltodict==0.13.0
zipp==1.0.0
zstd==1.5.2.5

Checklist

Please provide the following information:

  • [x] confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): ('2.4.0', 33816576) and ('2.4.0', 33816831)

  • [ ] Apache Kafka broker version: none

  • [x] Client configuration: {'auto.offset.reset': 'earliest', 'bootstrap.servers': '127.0.0.1:45305', 'debug': 'all', 'enable.auto.commit': False, 'enable.partition.eof': True, 'error_cb': <function _error_cb at 0x7efc3959aa20>, 'group.id': 'test-consumer-gqizofzwretest_parallel_journal_processor', 'logger': <Logger swh.journal.client.rdkafka (DEBUG)>, 'message.max.bytes': '524288000', 'on_commit': <function _on_commit at 0x7efc3959bce0>}

  • [x] Operating system: Debian 12

  • [x] Provide client logs (with 'debug': '..' as necessary): crash_log.txt

  • [ ] Provide broker log excerpts

  • [ ] Critical issue

progval avatar Jun 04 '24 12:06 progval

Interested 🗿

KudoKhang avatar Jun 17 '24 03:06 KudoKhang

Appears something in librdkafka is using already freed memory when trying to cleanup a producer being deallocated. Likely there's some race or a missing GIL lock that's needed. I know this issue is a year old but @progval does this still occur on the most recent release version? Also what python version did you use?

MSeal avatar Jul 23 '25 22:07 MSeal

It still happens on confluent-kafka 2.11.0. Python 3.11.

progval avatar Jul 24 '25 12:07 progval

Thanks, that helps. I've started some discussions on this topic internally. We're queuing looking into this though not sure exactly when we'll get it resolved.

MSeal avatar Jul 25 '25 22:07 MSeal