Segmentation fault while using node_exporter 1.14.3.0 or 1.12.1.0
On AIX 7.2 TL5, we see segmentation fault on 1.12.1.0, upgraded to latest 1.14.3.0 but that too remains the same.
[1]+ Segmentation fault (core dumped) /usr/local/bin/node_exporter_aix -acMdiPf -p 10051
LPAR has many disks and 6 fiber adapters and quite busy ...Can someone help?
noticed that just before the segmentation fault we see this error - Error calling perfstat_diskpath: Invalid argument
Note that we are using Veritas DMP for managing SAN paths.
Hi,
Could you see if you can find the location of the segfault? You may see line numbers in the 'errpt -a' output, or you could enable core dumps and take a look using a debugger like gdb. I can give you some better details if needed, and if you are able to share the core dump could take a look myself.
Did this work for you before an upgrade to 1.12.1.0, or is this a new deployent?
Do the number of disks/adapters change frequently?
Does this happen at startup of the exporter, or does it run for some time beofre it happens?
Thor
Core get generated under (/) root directory.
root@or1xx003[/]# ls -l core -rw------- 1 root system 4110233 May 23 06:15 core
Behavior is same with 1.12 or 1.14 node exporter agent.
root@or1xx003[/]# lslpp -l|grep -i node node_exporter_aix.rte 1.14.3.0 COMMITTED prometheus node_exporter for
The no of disks, adapters are not changed frequently, they remains pretty static. After running the node exporter with or without arguments, it crashes immediately after 1-2 minutes leaving a error "Error calling perfstat_diskpath: Invalid argument" on the console. I can share you core file, let me where I can upload it for you.
majority of our systems are with veritas cluster, veritas Volume manager and veritas dynamic multipath VxDMP where this issue is observed.
Thank you for a quick reply
Regards Yash
Could you please gzip the core and attach it to this case?
I have noticed some txt in core stating root login is disabled. We have disabled direct root login all AIX. After enabling direct root in sshd_config, I see following in errpt.
root@or1xxx[/]# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION A924A5FC 0527004122 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED A924A5FC 0527003722 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
root@or1xxx[/]# errpt -a -j A924A5FC
LABEL: CORE_DUMP IDENTIFIER: A924A5FC
Date/Time: Fri May 27 00:41:08 2022 Sequence Number: 1837132 Machine Id: 00CB2F274C00 Node Id: or1sxxxx Class: S Type: PERM WPAR: Global Resource Name: SYSPROC
Description SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes SOFTWARE PROGRAM
User Causes USER GENERATED SIGNAL
Recommended Actions
CORRECT THEN RETRY
Failure Causes SOFTWARE PROGRAM
Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data SIGNAL NUMBER 11 USER'S PROCESS ID: 32833812 FILE SYSTEM SERIAL NUMBER 1 INODE NUMBER 2 CORE FILE NAME //core PROGRAM NAME node_exporter_aix STACK EXECUTION DISABLED 0 COME FROM ADDRESS REGISTER perfstat_ 13C
PROCESSOR ID hw_fru_id: 2 hw_cpu_id: 35
ADDITIONAL INFORMATION _Z12gathe 128 _Z12gathe 80
Symptom Data REPORTABLE 1 INTERNAL ERROR 0 SYMPTOM CODE PCSS/SPI2 FLDS/node_expo SIG/11 FLDS/Z12gathe VALU/128 FLDS/perfstat
LABEL: CORE_DUMP IDENTIFIER: A924A5FC
Date/Time: Fri May 27 00:37:08 2022 Sequence Number: 1837131 Machine Id: 00CB2F274C00 Node Id: or1xxx Class: S Type: PERM WPAR: Global Resource Name: SYSPROC
Description SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes SOFTWARE PROGRAM
User Causes USER GENERATED SIGNAL
Recommended Actions
CORRECT THEN RETRY
Failure Causes SOFTWARE PROGRAM
Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data SIGNAL NUMBER 11 USER'S PROCESS ID: 33095964 FILE SYSTEM SERIAL NUMBER 1 INODE NUMBER 2 CORE FILE NAME //core PROGRAM NAME node_exporter_aix STACK EXECUTION DISABLED 0 COME FROM ADDRESS REGISTER perfstat_ 13C
PROCESSOR ID hw_fru_id: 2 hw_cpu_id: 35
ADDITIONAL INFORMATION _Z12gathe 128 _Z12gathe 80
Symptom Data REPORTABLE 1 INTERNAL ERROR 0 SYMPTOM CODE PCSS/SPI2 FLDS/node_expo SIG/11 FLDS/Z12gathe VALU/128 FLDS/perfstat
but core dump is still happening immediately after running it. /usr/local/bin/node_exporter_aix -acMdiPf -p 10051 &
I have couple of questions:
- Can we enable debug on node_exporter or enable logging to a file?
- is there anyway we skip diskpath module?
At this moment, I am not able to share core dump due to company policy.
Thanks Yash
- Unfortunately there is no additional debugging available in node_exporter_aix.
- If you run 'node_exporter_aix -h', you will see the available modules. You should be able to include only the modules you want. (exclude -D).
Do you have access to gdb on AIX? If so, could you copy the core and the node_exporter_aix binary to a server with gdb available and run 'gdb node_exporter_aix core'.
When in gdb, please run 'where' to get a stack trace from the core file.
You could also try to run node_exporter_aix with one module enabled at a time, to try to zero in on where the issue is.
Also, what version of AIX are you running where it is crashing? (oslevel -s)
I have installed gdb and got following, did gdb on other cores as well they all states that Program terminated with signal SIGSEGV. Let me know if you need more details from core.
root@or1xxx001# /opt/freeware/bin/gdb aix_node_exporter core GNU gdb (GDB) 10.2 .... Core was generated by `node_exporter_aix'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x10118328 in ?? () (gdb) where #0 0x10118328 in ?? () #1 0x10118280 in ?? () #2 0x10029078 in ?? () #3 0x1002ae04 in ?? () #4 0x1042bdcc in ?? () #5 0x10409d34 in ?? () #6 0x103ea18c in ?? () #7 0x103c0044 in ?? () #8 0x103ad424 in ?? () #9 0x103b7b9c in ?? () #10 0x103b78ec in ?? () #11 0x103b762c in ?? () #12 0x103b736c in ?? () #13 0x103b70b8 in ?? () #14 0x103b6de8 in ?? () #15 0x103b45fc in ?? () #16 0x102fd74c in ?? () #17 0x102f7df8 in ?? () #18 0x102f7318 in ?? () #19 0x1033d858 in ?? () #20 0x102ec250 in ?? () #21 0x100294b0 in ?? () #22 0x1002a674 in ?? () #23 0x10029ccc in ?? () #24 0x1002d01c in ?? () #25 0x1002cf70 in ?? () #26 0x1002cec0 in ?? () #27 0x102f3534 in ?? () #28 0xd0579fc8 in ?? () #29 0x00000000 in ?? ()
recently I deployed 1.12.1.0 all over AIX LPARs. do you recommend to upgrade 1.14.3.0?
if I exclude (D) and (d) from /usr/local/bin/node_exporter_aix command lines, I see agent doesn't crash, but I end up loosing data related to disk queue, disks timers etc..
Now, It seems I have to run this agent with two diff command line arguments e.g.
AIX LPARs without VxDMP
/usr/local/bin/node_exporter_aix -p
AIX LPARs with VxDMP
/usr/local/bin/node_exporter_aix -cCAMmiabPf -p
Let me know if you need more info from core.
Thank you for all the help!
Yash
Hmmm, interesting. I was expecting the names of the functions to be displayed.
Could you please try to run this version, it will output some debugging data that could help me locate the issue.
Also, try to execute the exporter under gdb:
gdb --args node_exporter_aix
then type run to execute.
If it crashes, run 'where' and maybe 'list' as well.
I have attached the debug build of the exporter in this comment.
We are having a similar issue, but with "stock" aix file systems and a large number of disks.
This is the program output. Number of diskpath records: 0 Error calling perfstat_diskpath: Invalid argument Number of memory_page records: 4 .... lines deleted by me Number of disk records: 520 Segmentation fault
Under gdb it doesn't tell us much more... gdb --args node_exporter_aix_debug -p 9200 GNU gdb (GDB) 10.2 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "powerpc64-ibm-aix7.1.0.0". Type "show configuration" for configuration details. For bug reporting instructions, please see: https://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from node_exporter_aix_debug... (gdb) show args Argument list to give program being debugged when it is started is "-p 9200". (gdb) set follow-fork-mode child (gdb) run Starting program: /var/tmp/node_exporter_aix_debug -p 9200 [New Thread 1] Node exporter for AIX version 1.12.1.0 listening on port 9200 [New Thread 258] [Attaching after Thread 258 fork to child process 21103002] [New inferior 2 (process 21103002)] [Detaching after fork from parent process 12386564] [Inferior 1 (process 12386564) detached]
Thread 2.1 received signal SIGTRAP, Trace/breakpoint trap. [Switching to process 21103002] 0x10000100 in ?? () (gdb) where #0 0x10000100 in ?? () #1 0xdeadbeef in ?? () (gdb) list 27 main.cpp: A file or directory in the path name does not exist..
Found that this is happening due to memory leakage, I have added the calloc(dinamic memory allocation) after that this issue got fixed in my AIX servers. Please reffere pull request #33 #34 #35