Segmentation fault(coredump) when using -c or -C options
We have deployed the exporter to approximately 200 AIX servers of various versions and TL levels with no issues.
There are 10 servers, all atleast running AIX 7.1 that are having issues. When we set either -C or -c and Prometheus initiates the scrape, we get a segmentation fault. This happens on all versions of the exporter that we have tested it on (1.14.3.0, 1.12.1.0, 1.8.0.0, maybe others)
./node_exporter_aix -p 50005 -a -cmdif Node exporter for AIX version 1.14.3.0 listening on port 50005 Segmentation fault(coredump)
We tested the debug version that was posted in another segmentation fault issue, and got a little extra info:
./node_exporter_aix_debug -p 50005 -a -cmdif Node exporter for AIX version 1.12.1.0 listening on port 50005 Number of cpu records: 160 Segmentation fault(coredump)
We found that 9 of the 10 servers have 8 SMT threads with over 128 virtual CPU’s allocated.
All the other servers that are working have less than 64 virtual cpu’s.
Is there a limit on number of CPUs that we could be hitting to cause the segmentation faults?
Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl? Testing it with some of our users and it solved segmentation fault, would love to see if it also solves your issues. Once its baked in a bit going to submit PR to upstream the changes.
We were able to confirm that 120 logical cpu's is fine, but adding 1 more(smt8) to 128 logical cpu's causes the segmentation fault
Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl?
We will give this a try today!
Can you give https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.6 a whirl? Testing it with some of our users and it solved segmentation fault, would love to see if it also solves your issues. Once its baked in a bit going to submit PR to upstream the changes.
This version seems to be working initially with only "-c" on 128+(tested up to 168) logical cpus. Definitly an improvement. The "-C" is still causing the same segmentation fault errors
https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.7 <- give this a whirl. The -C goes through a different path than other collects so had to change that one too.
https://github.com/grafana/node_exporter_aix/releases/tag/v1.15.7 <- give this a whirl. The -C goes through a different path than other collects so had to change that one too.
That seems to be running with no segmentation faults!
During the issues with this, we noticed that our CPU usage % doesn't seem to be coming out right on this or the older versions. Have you noticed this? This probably doesn't belong in this thread, I can start a new one to discuss.
I haven't but its not something I have looked into. If you want to start a new discussion and tag me with the exact details, I can take a look.
Please refer pull request #33 #34 #35, Whether that is fixing your issue.