nut icon indicating copy to clipboard operation
nut copied to clipboard

nut-ipmi driver segfault on PowerEdge R720xd

Open victorclaessen opened this issue 2 months ago • 13 comments

When I first ran apt install nut-ipmi, my server shut down suddenly. That wasn't great. After powering if back up, I cannot get the nut-ipmipsu driver to run. It dies with a segfault (see log below).

Any ideas on what I could try to fix this?

Best regards,

Victor

racadm getsysinfo

System Model            = PowerEdge R720xd
System Revision         = I
System BIOS Version     = 2.9.0
OS Name                 = Debian GNU/Linux 13 (trixie)
OS Version              = 13 (trixie) Kernel 6.14.11-4-pve (x86_64)

/etc/ups.conf

[nutdev2]
        driver = "nut-ipmipsu"
        port = "id1"

journalctl -f

Nov 21 09:55:52 myhost systemd[1]: [email protected]: Control process exited, code=exited, status=1/FAILURE
Nov 21 09:55:52 myhost systemd[1]: [email protected]: Failed with result 'exit-code'.
Nov 21 09:55:52 myhost systemd[1]: Failed to start [email protected] - Network UPS Tools - device driver for NUT device 'nutdev2'.
Nov 21 09:56:07 myhost systemd[1]: [email protected]: Scheduled restart job, restart counter is at 81.
Nov 21 09:56:07 myhost systemd[1]: Starting [email protected] - Network UPS Tools - device driver for NUT device 'nutdev2'...
Nov 21 09:56:08 myhost kernel: nut-ipmipsu[24298]: segfault at 5ce5d4b5ecb9 ip 00007d41477382b3 sp 00007ffd8b1fe610 error 4 in libfreeipmi.so.17.2.13[1e82b3,7d41476b6000+84000] likely on CPU 27 (core 1, socket 1)
Nov 21 09:56:08 myhost kernel: Code: 84 00 00 00 00 00 90 48 89 df 48 8b 5b 18 e8 84 03 f8 ff 48 85 db 75 ef 49 8b 1c 24 48 85 db 74 2f 66 0f 1f 44 00 00 48 89 dd <48> 8b 5b 08 48 8b 7d 00 48 85 ff 74 0c 49 8b 44 24 18 48 85 c0 74
Nov 21 09:56:08 myhost nut-driver@nutdev2[24271]: Driver exited abnormally

victorclaessen avatar Nov 21 '25 09:11 victorclaessen

Odd, it seems like the driver was built against one version of the library, and finds another at run time?..

Can you please double-check with a custom build of NUT (you can run the driver from build area to test), so it is certain which lib gets linked?

For details see wiki https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests

jimklimov avatar Nov 21 '25 09:11 jimklimov

Also, try starting with higher debug verbosity (on command-line, or with debug_min = 6 in ups.conf, see https://github.com/networkupstools/nut/wiki/Changing-NUT-daemon-debug-verbosity

Maybe the library linking is OK, but the driver does pass some garbage to it - and a trace would help find when and what.

jimklimov avatar Nov 21 '25 09:11 jimklimov

Does this help?

/usr/lib/nut/nut-ipmipsu -s id1 -d1 -x port=id1 -DDDDDD
Network UPS Tools - IPMI PSU driver 0.32 (2.8.1)
Warning: This is an experimental driver.
Some features may not function correctly.

   0.000000     [D1] Network UPS Tools version 2.8.1 (release/snapshot of 2.8.1) built with gcc (Debian 14.2.0-19) 14.2.0 and configured with flags: --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --prefix=/usr --sysconfdir=/etc/nut --includedir=/usr/include --mandir=/usr/share/man --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=/usr/libexec --with-ssl --with-nss --with-cgi --with-dev --enable-static --with-statepath=/run/nut --with-altpidpath=/run/nut --with-drvpath=/usr/lib/nut --with-cgipath=/usr/lib/cgi-bin/nut --with-htmlpath=/usr/share/nut/www --with-pidpath=/run/nut --datadir=/usr/share/nut --with-pkgconfig-dir=/usr/lib/x86_64-linux-gnu/pkgconfig --with-user=nut --with-group=nut --with-udev-dir=/usr/lib/udev --with-systemdsystemunitdir=/usr/lib/systemd/system --with-systemdshutdowndir=/usr/lib/systemd/system-shutdown --with-systemdtmpfilesdir=/usr/lib/tmpfiles.d --with-python=python3 --with-python3=/usr/bin/python3 --with-doc=man
   0.000078     [D1] debug level is '6'
   0.000100     [D5] send_to_all: SETINFO driver.debug "6"
   0.000115     [D5] send_to_all: SETFLAGS driver.debug RW NUMBER
   0.002413     [D1] Succeeded to become_user(nut): now UID=106 GID=106
   0.002447     [D5] send_to_all: SETINFO device.type "ups"
   0.002466     [D5] send_to_all: SETINFO driver.state "init.device"
   0.002478     [D1] upsdrv_initups...
   0.002492     [D2] Device ID 0x1
   0.002692     [D1] nut-libfreeipmi: nutipmi_open()...
   0.002857     [D1] FreeIPMI initialized...
   0.015407     [D1] entering libfreeipmi_get_board_info()
   0.015547     [D5] FRU Board Language: English
   0.015570     [D2] FRU Board Manufacturing Date/Time: 10/21/14 - 21:48:00
   0.020353     [D1] entering libfreeipmi_get_psu_info()
   0.020452     [D1] libfreeipmi_get_psu_info() retrieved successfully
   0.023717     [D3] Found 150 records in SDR cache
   0.023762     [D5] Checking record 0 (/150)
   0.023771     [D1] =======> not device locator (2)!!
   0.023781     [D5] Checking record 1 (/150)
   0.023787     [D1] =======> not device locator (2)!!
   0.023796     [D5] Checking record 2 (/150)
   0.023806     [D1] =======> not device locator (18)!!
   0.023818     [D5] Checking record 3 (/150)
   0.023849     [D2] Checking device 1/0
   0.023863     [D5] Checking record 4 (/150)
   0.023881     [D2] Checking device 0/176
   0.023893     [D5] Checking record 5 (/150)
   0.023910     [D2] Checking device 0/176
   0.023922     [D5] Checking record 6 (/150)
   0.023939     [D2] Checking device 1/1
   0.023957     [D1] Found device id 1
   0.023970     [D5] Checking record 1 (/150)
   0.024036     [D5] Checking record 2 (/150)
   0.024063     [D5] Checking record 3 (/150)
...
   0.025368     [D5] Checking record 63 (/150)
   0.025388     [D1] Found record id = 63 for device id 1
   0.025401     [D5] Checking record 64 (/150)
...
   0.025713     [D5] Checking record 78 (/150)
   0.025739     [D1] Found record id = 78 for device id 1
   0.025752     [D5] Checking record 79 (/150)
   0.025778     [D5] Checking record 80 (/150)
   0.025802     [D1] Found record id = 80 for device id 1
   0.025815     [D5] Checking record 81 (/150)
   0.025841     [D5] Checking record 82 (/150)
   0.025864     [D5] Checking record 83 (/150)
   0.025884     [D1] Found record id = 83 for device id 1
   0.025902     [D5] Checking record 84 (/150)
...
   0.027383     [D5] Checking record 149 (/150)
   0.027395     [D5] Checking record 150 (/150)
Segmentation fault

victorclaessen avatar Nov 21 '25 15:11 victorclaessen

I get exactly the same after running the nut-ipmipsu made with ./ci_build.sh inplace.

victorclaessen avatar Nov 21 '25 17:11 victorclaessen

Thanks for checking. By logged strings, it seems like this loops here: https://github.com/networkupstools/nut/blob/fcdff401170426f1b00e2aedb3657f7e0f08a134/drivers/nut-libfreeipmi.c#L687-L707

and "Found record id = %u for device id %i" pops up below in the method (but above cleanup label) at https://github.com/networkupstools/nut/blob/fcdff401170426f1b00e2aedb3657f7e0f08a134/drivers/nut-libfreeipmi.c#L797

Can you try running this in a debugger - maybe stepping through the driver with an IDE, or at least getting a backtrace of the fault with GDB? IIRC, something like:

:; LD_LIBRARY_PATH=`pwd`/clients/.libs:$LD_LIBRARY_PATH gdb --args ./drivers/nut-ipmipsu -s id1 -d1 -x port=id1 -DDDDDD

> run
(fail)
> bt

jimklimov avatar Nov 21 '25 18:11 jimklimov

   0.034854     [D5] Checking record 148 (/150)
   0.034868     [D5] Checking record 149 (/150)
   0.034882     [D5] Checking record 150 (/150)

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7b4b2b3 in ?? () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
(gdb) bt
#0  0x00007ffff7b4b2b3 in ?? () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
#1  0x00007ffff7b1b43e in ipmi_sdr_ctx_destroy () from /lib/x86_64-linux-gnu/libfreeipmi.so.17
#2  0x000055555555adc2 in libfreeipmi_cleanup () at nut-libfreeipmi.c:342
#3  0x000055555555b824 in nut_ipmi_open (ipmi_id=<optimized out>, ipmi_dev=ipmi_dev@entry=0x55555557f780 <ipmi_dev>)
    at nut-libfreeipmi.c:275
#4  0x000055555555ac98 in upsdrv_initups () at nut-ipmipsu.c:269
#5  0x00005555555598d3 in main (argc=<optimized out>, argv=<optimized out>) at main.c:2902

victorclaessen avatar Nov 21 '25 20:11 victorclaessen

(thanks for your efforts, btw!)

victorclaessen avatar Nov 21 '25 20:11 victorclaessen

Hard to say, googling brings up almost nothing about the method.

There are sources at e.g. https://packages.debian.org/sid/libfreeipmi-dev (tarball at least) to peruse... Maybe you can build a copy of the library (with debug symbols and all), and rebuild NUT against it (or maybe LD_LIBRARY_PATH would suffice to use your build), and it would expose what happens. SIGSEGV is usually about either NULL dereference, or using an already freed memory area.

jimklimov avatar Nov 23 '25 00:11 jimklimov

I think the problem is in the library (crash is a couple of methods into it, so their bad input checking even if we are at fault somehow), but wonder if nut-scanner mis-behaves the same or works okay?

jimklimov avatar Nov 23 '25 10:11 jimklimov

Hard to say, googling brings up almost nothing about the method.

There are sources at e.g. https://packages.debian.org/sid/libfreeipmi-dev (tarball at least) to peruse... Maybe you can build a copy of the library (with debug symbols and all), and rebuild NUT against it (or maybe LD_LIBRARY_PATH would suffice to use your build), and it would expose what happens. SIGSEGV is usually about either NULL dereference, or using an already freed memory area.

Sorry, I was occupied elsewhere the last weeks.

I am having no luck building a copy of libfreeipmi-dev. I keep running into all kinds of dependency hell problems.

mkdir ~/freeipmi-build && cd ~/freeipmi-build
apt source freeipmi
cd freeipmi-1.6.15/
./configure --prefix=/usr/local --enable-debug
...<SNIP>...
checking whether make sets $(MAKE)... (cached) yes
checking whether ln -s works... yes
checking whether it is safe to define __EXTENSIONS__... yes
checking whether _XOPEN_SOURCE should be defined... no
./configure: line 13798: PKG_PROG_PKG_CONFIG: command not found
checking for encryption support... yes
./configure: line 13870: syntax error near unexpected token `GCRYPT,'
./configure: line 13870: `  PKG_CHECK_MODULES(GCRYPT, libgcrypt, have_gcrypt=yes,'

apt install pkg-config libgcrypt-dev
Note, selecting 'libgcrypt20-dev' instead of 'libgcrypt-dev'
pkg-config is already the newest version (1.8.1-4).
libgcrypt20-dev is already the newest version (1.11.0-7).

😢

victorclaessen avatar Dec 12 '25 22:12 victorclaessen

I have asked for help here: https://savannah.gnu.org/bugs/index.php?67810

victorclaessen avatar Dec 12 '25 23:12 victorclaessen

PKG_PROG_PKG_CONFIG: command not found

Seems you are missing m4 macros, so dev packages for pkg-config, maybe libtool, autoconf, automake etc.

As a first shot, install the build dependencies for NUT, that might suffice. See docs/config-prereqs.txt

jimklimov avatar Dec 13 '25 08:12 jimklimov

I already have those installed, I've previously built nut locally with ci_build.sh. I followed the prerequisites/debian part of the manual.

victorclaessen avatar Dec 13 '25 10:12 victorclaessen