nut-ipmi driver segfault on PowerEdge R720xd
When I first ran apt install nut-ipmi, my server shut down suddenly. That wasn't great. After powering if back up, I cannot get the nut-ipmipsu driver to run. It dies with a segfault (see log below).
Any ideas on what I could try to fix this?
Best regards,
Victor
racadm getsysinfo
System Model = PowerEdge R720xd
System Revision = I
System BIOS Version = 2.9.0
OS Name = Debian GNU/Linux 13 (trixie)
OS Version = 13 (trixie) Kernel 6.14.11-4-pve (x86_64)
/etc/ups.conf
[nutdev2]
driver = "nut-ipmipsu"
port = "id1"
journalctl -f
Nov 21 09:55:52 myhost systemd[1]: [email protected]: Control process exited, code=exited, status=1/FAILURE
Nov 21 09:55:52 myhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Nov 21 09:55:52 myhost systemd[1]: Failed to start [email protected] - Network UPS Tools - device driver for NUT device 'nutdev2'.
Nov 21 09:56:07 myhost systemd[1]: [email protected]: Scheduled restart job, restart counter is at 81.
Nov 21 09:56:07 myhost systemd[1]: Starting [email protected] - Network UPS Tools - device driver for NUT device 'nutdev2'...
Nov 21 09:56:08 myhost kernel: nut-ipmipsu[24298]: segfault at 5ce5d4b5ecb9 ip 00007d41477382b3 sp 00007ffd8b1fe610 error 4 in libfreeipmi.so.17.2.13[1e82b3,7d41476b6000+84000] likely on CPU 27 (core 1, socket 1)
Nov 21 09:56:08 myhost kernel: Code: 84 00 00 00 00 00 90 48 89 df 48 8b 5b 18 e8 84 03 f8 ff 48 85 db 75 ef 49 8b 1c 24 48 85 db 74 2f 66 0f 1f 44 00 00 48 89 dd <48> 8b 5b 08 48 8b 7d 00 48 85 ff 74 0c 49 8b 44 24 18 48 85 c0 74
Nov 21 09:56:08 myhost nut-driver@nutdev2[24271]: Driver exited abnormally
Odd, it seems like the driver was built against one version of the library, and finds another at run time?..
Can you please double-check with a custom build of NUT (you can run the driver from build area to test), so it is certain which lib gets linked?
For details see wiki https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests
Also, try starting with higher debug verbosity (on command-line, or with debug_min = 6 in ups.conf, see https://github.com/networkupstools/nut/wiki/Changing-NUT-daemon-debug-verbosity
Maybe the library linking is OK, but the driver does pass some garbage to it - and a trace would help find when and what.
Does this help?
/usr/lib/nut/nut-ipmipsu -s id1 -d1 -x port=id1 -DDDDDD
Network UPS Tools - IPMI PSU driver 0.32 (2.8.1)
Warning: This is an experimental driver.
Some features may not function correctly.
0.000000 [D1] Network UPS Tools version 2.8.1 (release/snapshot of 2.8.1) built with gcc (Debian 14.2.0-19) 14.2.0 and configured with flags: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --prefix=/usr --sysconfdir=/etc/nut --includedir=/usr/include --mandir=/usr/share/man --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=/usr/libexec --with-ssl --with-nss --with-cgi --with-dev --enable-static --with-statepath=/run/nut --with-altpidpath=/run/nut --with-drvpath=/usr/lib/nut --with-cgipath=/usr/lib/cgi-bin/nut --with-htmlpath=/usr/share/nut/www --with-pidpath=/run/nut --datadir=/usr/share/nut --with-pkgconfig-dir=/usr/lib/x86_64-linux-gnu/pkgconfig --with-user=nut --with-group=nut --with-udev-dir=/usr/lib/udev --with-systemdsystemunitdir=/usr/lib/systemd/system --with-systemdshutdowndir=/usr/lib/systemd/system-shutdown --with-systemdtmpfilesdir=/usr/lib/tmpfiles.d --with-python=python3 --with-python3=/usr/bin/python3 --with-doc=man
0.000078 [D1] debug level is '6'
0.000100 [D5] send_to_all: SETINFO driver.debug "6"
0.000115 [D5] send_to_all: SETFLAGS driver.debug RW NUMBER
0.002413 [D1] Succeeded to become_user(nut): now UID=106 GID=106
0.002447 [D5] send_to_all: SETINFO device.type "ups"
0.002466 [D5] send_to_all: SETINFO driver.state "init.device"
0.002478 [D1] upsdrv_initups...
0.002492 [D2] Device ID 0x1
0.002692 [D1] nut-libfreeipmi: nutipmi_open()...
0.002857 [D1] FreeIPMI initialized...
0.015407 [D1] entering libfreeipmi_get_board_info()
0.015547 [D5] FRU Board Language: English
0.015570 [D2] FRU Board Manufacturing Date/Time: 10/21/14 - 21:48:00
0.020353 [D1] entering libfreeipmi_get_psu_info()
0.020452 [D1] libfreeipmi_get_psu_info() retrieved successfully
0.023717 [D3] Found 150 records in SDR cache
0.023762 [D5] Checking record 0 (/150)
0.023771 [D1] =======> not device locator (2)!!
0.023781 [D5] Checking record 1 (/150)
0.023787 [D1] =======> not device locator (2)!!
0.023796 [D5] Checking record 2 (/150)
0.023806 [D1] =======> not device locator (18)!!
0.023818 [D5] Checking record 3 (/150)
0.023849 [D2] Checking device 1/0
0.023863 [D5] Checking record 4 (/150)
0.023881 [D2] Checking device 0/176
0.023893 [D5] Checking record 5 (/150)
0.023910 [D2] Checking device 0/176
0.023922 [D5] Checking record 6 (/150)
0.023939 [D2] Checking device 1/1
0.023957 [D1] Found device id 1
0.023970 [D5] Checking record 1 (/150)
0.024036 [D5] Checking record 2 (/150)
0.024063 [D5] Checking record 3 (/150)
...
0.025368 [D5] Checking record 63 (/150)
0.025388 [D1] Found record id = 63 for device id 1
0.025401 [D5] Checking record 64 (/150)
...
0.025713 [D5] Checking record 78 (/150)
0.025739 [D1] Found record id = 78 for device id 1
0.025752 [D5] Checking record 79 (/150)
0.025778 [D5] Checking record 80 (/150)
0.025802 [D1] Found record id = 80 for device id 1
0.025815 [D5] Checking record 81 (/150)
0.025841 [D5] Checking record 82 (/150)
0.025864 [D5] Checking record 83 (/150)
0.025884 [D1] Found record id = 83 for device id 1
0.025902 [D5] Checking record 84 (/150)
...
0.027383 [D5] Checking record 149 (/150)
0.027395 [D5] Checking record 150 (/150)
Segmentation fault
I get exactly the same after running the nut-ipmipsu made with ./ci_build.sh inplace.
Thanks for checking. By logged strings, it seems like this loops here: https://github.com/networkupstools/nut/blob/fcdff401170426f1b00e2aedb3657f7e0f08a134/drivers/nut-libfreeipmi.c#L687-L707
and "Found record id = %u for device id %i" pops up below in the method (but above cleanup label) at https://github.com/networkupstools/nut/blob/fcdff401170426f1b00e2aedb3657f7e0f08a134/drivers/nut-libfreeipmi.c#L797
Can you try running this in a debugger - maybe stepping through the driver with an IDE, or at least getting a backtrace of the fault with GDB? IIRC, something like:
:; LD_LIBRARY_PATH=`pwd`/clients/.libs:$LD_LIBRARY_PATH gdb --args ./drivers/nut-ipmipsu -s id1 -d1 -x port=id1 -DDDDDD
> run
(fail)
> bt
0.034854 [D5] Checking record 148 (/150)
0.034868 [D5] Checking record 149 (/150)
0.034882 [D5] Checking record 150 (/150)
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b4b2b3 in ?? () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
(gdb) bt
#0 0x00007ffff7b4b2b3 in ?? () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
#1 0x00007ffff7b1b43e in ipmi_sdr_ctx_destroy () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
#2 0x000055555555adc2 in libfreeipmi_cleanup () at nut-libfreeipmi.c:342
#3 0x000055555555b824 in nut_ipmi_open (ipmi_id=<optimized out>, ipmi_dev=ipmi_dev@entry=0x55555557f780 <ipmi_dev>)
at nut-libfreeipmi.c:275
#4 0x000055555555ac98 in upsdrv_initups () at nut-ipmipsu.c:269
#5 0x00005555555598d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:2902
(thanks for your efforts, btw!)
Hard to say, googling brings up almost nothing about the method.
There are sources at e.g. https://packages.debian.org/sid/libfreeipmi-dev (tarball at least) to peruse... Maybe you can build a copy of the library (with debug symbols and all), and rebuild NUT against it (or maybe LD_LIBRARY_PATH would suffice to use your build), and it would expose what happens. SIGSEGV is usually about either NULL dereference, or using an already freed memory area.
I think the problem is in the library (crash is a couple of methods into it, so their bad input checking even if we are at fault somehow), but wonder if nut-scanner mis-behaves the same or works okay?
Hard to say, googling brings up almost nothing about the method.
There are sources at e.g. https://packages.debian.org/sid/libfreeipmi-dev (tarball at least) to peruse... Maybe you can build a copy of the library (with debug symbols and all), and rebuild NUT against it (or maybe LD_LIBRARY_PATH would suffice to use your build), and it would expose what happens. SIGSEGV is usually about either NULL dereference, or using an already freed memory area.
Sorry, I was occupied elsewhere the last weeks.
I am having no luck building a copy of libfreeipmi-dev. I keep running into all kinds of dependency hell problems.
mkdir ~/freeipmi-build && cd ~/freeipmi-build
apt source freeipmi
cd freeipmi-1.6.15/
./configure --prefix=/usr/local --enable-debug
...<SNIP>...
checking whether make sets $(MAKE)... (cached) yes
checking whether ln -s works... yes
checking whether it is safe to define __EXTENSIONS__... yes
checking whether _XOPEN_SOURCE should be defined... no
./configure: line 13798: PKG_PROG_PKG_CONFIG: command not found
checking for encryption support... yes
./configure: line 13870: syntax error near unexpected token `GCRYPT,'
./configure: line 13870: ` PKG_CHECK_MODULES(GCRYPT, libgcrypt, have_gcrypt=yes,'
apt install pkg-config libgcrypt-dev
Note, selecting 'libgcrypt20-dev' instead of 'libgcrypt-dev'
pkg-config is already the newest version (1.8.1-4).
libgcrypt20-dev is already the newest version (1.11.0-7).
😢
I have asked for help here: https://savannah.gnu.org/bugs/index.php?67810
PKG_PROG_PKG_CONFIG: command not found
Seems you are missing m4 macros, so dev packages for pkg-config, maybe libtool, autoconf, automake etc.
As a first shot, install the build dependencies for NUT, that might suffice. See docs/config-prereqs.txt
I already have those installed, I've previously built nut locally with ci_build.sh. I followed the prerequisites/debian part of the manual.