stat-core-merger stuck communicating with gdb
Platform: OLCF Summit
Versions: STAT from spack: spack install stat%[email protected] cxxflags=--std=c++14
==> 1 installed package
-- linux-rhel8-power9le / [email protected] ----------------------------
wn2frxd [email protected]%gcc cxxflags="--std=c++14" ~dysect~examples~fgfs~gui
3ra646m [email protected]%gcc cxxflags="--std=c++14" +atomic+chrono~clanglibcpp~container~context~coroutine+date_time~debug+exception~fiber+filesystem+graph~icu+iostreams+locale+log+math~mpi+multithreaded~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded+system~taggedlayout+test+thread+timer~versionedlayout+wave cxxstd=98 patches=93f4aad8f88d1437e50d95a2d066390ef3753b99ef5de24f7a46bc083bd6df06 visibility=hidden
4ucshfz [email protected]%gcc cxxflags="--std=c++14" ~ipo+openmp~stat_dysect~static build_type=RelWithDebInfo
bp7lk52 [email protected]%gcc cxxflags="--std=c++14" ~bzip2~debuginfod+nls~xz
5kqtqyt [email protected]%gcc cxxflags="--std=c++14" ~ipo+shared+tm build_type=RelWithDebInfo cxxstd=default patches=62ba015ebd1819c45bef47411540b789b493e31ca668c4ff4cb2afcbc306b476,ce1fb16fb932ce86a82ca87cf0431d1a8c83652af9f552b264213b2ff2945d73,d62cb666de4010998c339cde6f41c7623a07e9fc69e498f2e149821c0c2c6dd0
qizwje7 [email protected]%gcc cxxflags="--std=c++14" +pic
7lrjx2k [email protected]%gcc cxxflags="--std=c++14" ~ipo build_type=RelWithDebInfo
j56c46j [email protected]%gcc cxxflags="--std=c++14" ~doc~expat~ghostscript~gtkplus~gts~java~libgd~pangocairo~poppler~qt~quartz~x
7zttv3a [email protected]%gcc cxxflags="--std=c++14" +optimize+pic+shared
42awyk6 launchmon@master%gcc cxxflags="--std=c++14"
ehifwhj [email protected]%gcc cxxflags="--std=c++14"
nfkm5sn [email protected]%gcc cxxflags="--std=c++14"
xkkejlv [email protected]%gcc cxxflags="--std=c++14" ~lwthreads
cc2ohrr [email protected]%gcc cxxflags="--std=c++14" +bz2+ctypes+dbm~debug+libxml2+lzma+nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tix~tkinter~ucs4+uuid+zlib
p4xaimr [email protected]%gcc cxxflags="--std=c++14"
kjydyg7 [email protected]%gcc cxxflags="--std=c++14" ~jit+multibyte+utf
I was trying to collect/compare backtraces for ten core files with a command like this:
stat-core-merger -x =bedrock -F stdout -c /gpfs/alpine/csc332/scratch/${USER}/quintain-cores/
after fixing up python's string/bye challenges (maybe I goofed that!) , the command hangs. Running with -L debug shows me
115 core_file_merger:589 VERBOSE (MainThread) Processing started at 2022-02-17 09:43:54.919282
merging 10 trace files 000%115 core_file_merger:352 INFO (MainThread) Connecting gdb to the core file (/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2)
1226 core_file_merger:379 DEBUG (MainThread) Checking for gdb errors
1601 core_file_merger:427 DEBUG (MainThread) Find a value for the current rank
When I check with ps I see STAT is trying to do this:
gdb -ex set pagination 0 -ex cd /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex path /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex directory /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex set filename-display absolute --core=/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2 /autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin/bedrock
and when I run that command myself, gdb suggests it did not process the command line arguments as expected:
% gdb -ex set pagination 0 -ex cd /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex path /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex directory /autofs/nccs-svm1_home1/robl/src/mochi-quintain/tests -ex set filename-display absolute --core=/gpfs/alpine/csc332/scratch/robl/quintain-cores//core.2 /autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin/bedrock
Excess command line arguments ignored. (0 ...)
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64le-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
pagination: No such file or directory.
[New LWP 150624]
Core was generated by `bedrock '.
Program terminated with signal SIGINT, Interrupt.
#0 0x0000200000b76118 in ?? ()
Argument required (expression to compute).
Working directory /ccs/home/robl
(canonically /autofs/nccs-svm1_home1/robl).
Executable and object file path: /sw/summit/xalt/1.2.1/bin:/sw/sources/lsf-tools/2.0/summit/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-8.3.1/gdb-10.2-zl2qphcj4naoqsp6thilh4w5kkcf7n2u/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/swig-4.0.2-p4xaimrohrzqshwsefj7heh6f3df7bya/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/pcre-8.44-kjydyg7oxoimrh47ooejkj2jtv3uke3f/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/mrnet-5.0.1-3-xkkejlv2lt7xcsb65ga4thqntzrmoz3b/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/launchmon-master-42awyk6qtdhwgsen7k3bqldrdzc2es2o/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/libgcrypt-1.9.3-ehifwhjdwrb7tmapmkylstbqvp47gu62/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/libgpg-error-1.42-nfkm5snffx46qwffiwfngffnwsql2y6u/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/graphviz-2.49.0-j56c46j34im324olozfvvcmoslfphibq/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/graphlib-3.0.0-7lrjx2kdz5rg4e5g6t33gkzko7wfbm7n/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/dyninst-10.1.0-4ucshfzv5b574jurzctlbt7w3qxmgf2i/bin:/sw/summit/gcc/10.2.0-2/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-quintain-main-nkuuhxcrvm3irrqrxctkfysukzyb2xue/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-bedrock-main-ibxscgvcko74xoyb6sv4lphuiv3deryo/bin:/sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-11.1.0/spectrum-mpi-10.4.0.3-20210112-6kg6anupjriji6pnvijebfn7ha5vsqp2/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-margo-main-bt67pbipf3q56ijgm2ij7nzjnlbvhruo/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/libfabric-1.13.2-hsk4mn4hjtnv7bnfptpzwhno4kjsqhvw/bin:/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-11.1.0/mochi-abt-io-0.5.1-ir7rmxlx4ebamktb7xtwo5iqyyzuum4d/bin:/sw/sources/hpss/bin:/autofs/nccs-svm1_home1/robl/src/spack/bin:/opt/ibm/csm/bin:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/etc:/opt/ibm/spectrumcomputing/lsf/10.1.0.11/linux3.10-glibc2.17-ppc64le-csm/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibm/flightlog/bin:/opt/ibm/jsm/bin:/sw/sources/cgroup_tool/bin:/opt/puppetlabs/bin:/usr/lpp/mmfs/bin
Reinitialize source path to empty? (y or n)
in particular pagination: No such file or directory and Excess command line arguments ignored
If I re-run that command with all the -ex arguments quoted, gdb will give me the (gdb) prompt that the python script expects
Hacking up scripts/core_file_merger.py to add those quotes gave me the command line I expected, however it still hangs at Find a value for the current rank.
When I ctrl-c the process, the python backtrace tells me it's stuck in info threads:
Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/STATmain.py", line 134, in <module>
STATmerge_main(sys.argv[1:])
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 655, in STATmerge_main
ret = merger.run()
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/stat_merge_base.py", line 314, in run
trace_object = self.trace_type(filename, self.options)
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/stat_merge_base.py", line 49, in __init__
self.traces = self.get_traces()
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 535, in get_traces
core_file.process_core()
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 428, in process_core
rank_value = self.get_function_value(gdb, 'MPI_Comm_rank', 1)
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 216, in get_function_value
lines = gdb.communicate("info threads")
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 147, in communicate
return self.readlines()
File "/autofs/nccs-svm1_home1/robl/src/spack/opt/spack/linux-rhel8-power9le/gcc-10.2.0/stat-4.1.0-wn2frxd57sysvqvapa65yd5sqflvi3sr/lib/python3.6/site-packages/core_file_merger.py", line 128, in readlines
ch = self.subprocess.stdout.read(1).decode('utf-8')
Any suggestions for next steps? Thanks
I added additional logging to see what GDB is telling us. It is stuck here at a line that may or may not be needed on pppc64 le (which is the platform i'm on as it happens)
https://github.com/LLNL/STAT/blob/develop/scripts/core_file_merger.py#L409
I deleted that extra read but still had hangs with python3.
In the end I fell back to python-2.7 and now it's working (with that extra ppc64 readline deleted)
for the gdb hang, you may need to comment out these 3 lines:
if CoreFile.__options['cuda'] != 1:
lines2 = gdb.readlines()
lines += lines2
I don't exactly recall the history, but at some point we found this was necessary, but this appears to no longer be the case
I just commited changes to the develop branch to comment out those lines
A note for me to look one day at doing the gdb communication the other way around: instead of python reading gdb, have gdb execute a python script (https://sourceware.org/gdb/onlinedocs/gdb/Python-API.html)
@roblatham00 Good news, I think I figured out the source of the hang. I was able to reproduce stat-core-merger hangs on one of our CORAL systems and managed to fix it by flushing the input buffer to the gdb process during communicate(). The change is in develop in this commit https://github.com/LLNL/STAT/commit/19858dc096a9eae8461fa5811283e1554dc2cc58. Also, I was able to install the develop branch with this commit on our CORAL system using the gcc 8.3.1 compiler. Can you try this out and let me know if this resolves your issue?
Note if you still have your previous STAT installation, you could just try to modify your installed core_file_merger.py file and add the flush after the stdin.write().