node_exporter_aix icon indicating copy to clipboard operation
node_exporter_aix copied to clipboard

Segmentation fault(coredump) when using -c or -C options

Open dks0296586 opened this issue 3 years ago • 7 comments

We have deployed the exporter to approximately 200 AIX servers of various versions and TL levels with no issues.

There are 10 servers, all atleast running AIX 7.1 that are having issues. When we set either -C or -c and Prometheus initiates the scrape, we get a segmentation fault. This happens on all versions of the exporter that we have tested it on (1.14.3.0, 1.12.1.0, 1.8.0.0, maybe others)

./node_exporter_aix -p 50005 -a -cmdif Node exporter for AIX version 1.14.3.0 listening on port 50005 Segmentation fault(coredump)

We tested the debug version that was posted in another segmentation fault issue, and got a little extra info:

./node_exporter_aix_debug -p 50005 -a -cmdif Node exporter for AIX version 1.12.1.0 listening on port 50005 Number of cpu records: 160 Segmentation fault(coredump)

We found that 9 of the 10 servers have 8 SMT threads with over 128 virtual CPU’s allocated.
All the other servers that are working have less than 64 virtual cpu’s.

Is there a limit on number of CPUs that we could be hitting to cause the segmentation faults?

dks0296586 avatar Nov 01 '22 18:11 dks0296586

Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl? Testing it with some of our users and it solved segmentation fault, would love to see if it also solves your issues. Once its baked in a bit going to submit PR to upstream the changes.

mattdurham avatar Nov 02 '22 14:11 mattdurham

We were able to confirm that 120 logical cpu's is fine, but adding 1 more(smt8) to 128 logical cpu's causes the segmentation fault

Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl?

We will give this a try today!

dks0296586 avatar Nov 03 '22 13:11 dks0296586

Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl? Testing it with some of our users and it solved segmentation fault, would love to see if it also solves your issues. Once its baked in a bit going to submit PR to upstream the changes.

This version seems to be working initially with only "-c" on 128+(tested up to 168) logical cpus. Definitly an improvement. The "-C" is still causing the same segmentation fault errors

dks0296586 avatar Nov 03 '22 15:11 dks0296586

https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.7 <- give this a whirl. The -C goes through a different path than other collects so had to change that one too.

mattdurham avatar Nov 04 '22 11:11 mattdurham

https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.7 <- give this a whirl. The -C goes through a different path than other collects so had to change that one too.

That seems to be running with no segmentation faults!

During the issues with this, we noticed that our CPU usage % doesn't seem to be coming out right on this or the older versions. Have you noticed this? This probably doesn't belong in this thread, I can start a new one to discuss.

dks0296586 avatar Nov 04 '22 16:11 dks0296586

I haven't but its not something I have looked into. If you want to start a new discussion and tag me with the exact details, I can take a look.

mattdurham avatar Nov 07 '22 16:11 mattdurham

Please refer pull request #33 #34 #35, Whether that is fixing your issue.

lbsivahari avatar Aug 02 '23 17:08 lbsivahari