pmcd Error: Too many open files fills up the log
I have about 30 VMs with CentOS 7 kernel v. 3.10.0-1160.42.2.el7.x86_64 and PCP version 5.3.5-1.x86_64 installed, core PMDAs only:
root 1 pipe binary /var/lib/pcp/pmdas/root/pmdaroot
pmcd 2 dso pmcd_init /var/lib/pcp/pmdas/pmcd/pmda_pmcd.so
proc 3 pipe binary /var/lib/pcp/pmdas/proc/pmdaproc -d 3
pmproxy 4 dso pmproxy_init /var/lib/pcp/pmdas/mmv/pmda_mmv.so
xfs 11 pipe binary /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux 60 pipe binary /var/lib/pcp/pmdas/linux/pmdalinux
mmv 70 dso mmv_init /var/lib/pcp/pmdas/mmv/pmda_mmv.so
kvm 95 pipe binary /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
jbd2 122 dso jbd2_init /var/lib/pcp/pmdas/jbd2/pmda_jbd2.so
All of the hosts are only monitored with Prometheus. The problem I have been experiencing is that occasionally one of the host runs out of space on root partition due to pmcd.log filling up with messages like this one:
[Mon Feb 21 16:25:49] pmcd(99908) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files
It takes about 2 days for pmcd.log to grow from 0 to 30GB. As a workaround I set up daily log rotatation for all PCP logs:
/var/log/pcp/*/*.log {
rotate 5
daily
missingok
copytruncate
su root root
lastaction
/bin/systemctl restart pmcd pmproxy
endscript
}
This reduced the occurrence of the problem, but did not eliminate it. I noticed pmcd logs of 3GB in size where logrotation and service restart prevented a further growth, however today I saw pmcd.conf that grew up to 30GB and filled up the free space which which probably caused logrotate to fail, because logs of other service had not been rotated for 2 days as well.
I am not sure if this is related, but a day before the incident there was a segfault recorded in /var/log/messages:
Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]
Feb 18 23:47:21 node16 systemd: pmproxy.service: main process exited, code=killed, status=11/SEGV
Feb 18 23:47:21 node16 systemd: Unit pmproxy.service entered failed state.
Feb 18 23:47:21 node16 systemd: pmproxy.service failed.
Feb 18 23:47:22 node16 systemd: pmproxy.service holdoff time over, scheduling restart.
Feb 18 23:47:22 node16 systemd: Stopped Proxy for Performance Metrics Collector Daemon.
Feb 18 23:47:22 node16 systemd: Starting Proxy for Performance Metrics Collector Daemon...
Feb 18 23:47:22 node16 systemd: Started Proxy for Performance Metrics Collector Daemon.
It seems like PCP recovered from this error. About 4 hours later log rotation took place and pmcd was restarted as expected. About 6 hours after the rotation the first error was logged in pmcd.log, however system log did not record any other problems around that time. In pmproxy.log there were 2 errors logged without time stamps:
on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed
No errors were recorded in PMDA's logs.
If anyone is interested in investigating this issue I can provide more debug info if necessary.
hi @mac133k - thanks for reporting this! It's a new one to us ...
pmcd.log filling up with messages like this one:
Could you capture contents of "sudo ls -l /proc/[pmcd PID]/fd" under these conditions and share? It'll be alot of output I imagine, so maybe a summary of what the file descriptors look like, or copy the whole lot to a github gist perhaps?
It takes about 2 days for pmcd.log to grow from 0 to 30GB.
Wow, OK. That log message is ~120bytes. If we work off 2 days, that's 172800 seconds - at 30GB, if my back-of-the-napkin math is correct you're getting this message logged ~1500 times per second ... ? I can't think of anything in PCP itself that might do that, so really interested to see that fd listing - it feels like it must be a rogue pmcd client rapidly attempting new connections.
| Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]
If you have a core file somewhere (abrt?) it'd be really handy to get a symbolic stack trace from this.
I had another another occurrence of this problem. This time there was no segfault, so that must have been unrelated, unfortunately coredump was not available to look deeper into it. I did manage to capture more debug data for the main issue:
- this time pmcd.log grew from 0 to 46GB in 32 hours.
- there were 1024 FDs held by pmcd: ** 1012 sockets ** 13 pipes ** 1 file: pmcd.log
- in /etc/security/limits.conf:
* soft nofile 1048576
* hard nofile 1048576
- no other limits in /etc/security/limits.d
- only one instance of each PCP process:
pcp 24331 1 80 Feb20 ? 1-23:05:32 /usr/libexec/pcp/bin/pmcd -A
root 24335 24331 5 Feb20 ? 03:22:57 /var/lib/pcp/pmdas/root/pmdaroot
root 24336 24335 5 Feb20 ? 03:02:23 /var/lib/pcp/pmdas/proc/pmdaproc -d 3
root 24338 24335 4 Feb20 ? 02:51:53 /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
root 24339 24335 4 Feb20 ? 02:37:18 /var/lib/pcp/pmdas/linux/pmdalinux
root 24340 24335 3 Feb20 ? 02:20:07 /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
pcp 24341 1 0 Feb20 ? 00:12:28 /usr/libexec/pcp/bin/pmproxy -F -A
- pmproxy.log recorded:
on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed
A few more details that I need to add:
- VMs are monitored by Prometheus via PMProxy
- there are 2 Prometheus instances scraping metrics every 30 seconds: ** one instance scrapes everything from /metrics which amounts up to 6.5k metric instances (half of them being cgroup metrics), scrape durations are between 1.2 and 1.6 seconds ** another instance scrapes only selected metrics:
- job_name: pcp
honor_timestamps: true
params:
target:
- kernel,mem,swap,network,filesys,disk,hinv
so the URL is http://< hostname >:44323/metrics?target=kernel%2Cmem%2Cswap%2Cnetwork%2Cfilesys%2Cdisk%2Chinv and the number of metric instances is about 1.5k which takes between 0.4 and 0.7 seconds to scrape.
PCP processes do not seem to be overwhelmed with scrape requests judging by the CPU and memory utilisation per process. I suspect using the target param may be somehow causing the issue, because I remember having a problem some time ago with PCP v5.0 or 5.1 where scrapes were incomplete or timing out until I changed Prometheus config and split the original job into multiple jobs with only one metric group as a target each. I'm going to kill the Prometheus instance that has targets in the config and see if it makes any difference here.
| there were 1024 FDs held by pmcd:
Hmm, suspicious - seems like that maxfd setting of 1048576 hasn't taken effect? Might be worth checking it while pmcd is running via "cat /proc/[pmcd PID]/limits".
Indeed:
# cat /proc/83150/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31128 31128 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31128 31128 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
However 1024 is way more than a healthy PMCD process needs, so I would rather dig deeper to find the root cause than lift this limit wherever it is set.
Agreed, 1024 should be plenty. I wonder if we might be seeing a failure to cleanup timed-out connections between pmproxy and pmcd. You could experiment with larger values (well beyond your Prometheus scrape interval) of the polltimeout setting - in /etc/pcp/pmproxy/pmproxy.conf that would be:
[pmwebapi]
timeout = 120
(this is a value in seconds)
Querying metric endpoints without the target params helped to significantly reduce the occurrence of this issue, but did not eliminate it completely.
Just thought I'd chime in here- I'm having the same problem. Using prometheus, RedHat 8.5, and PCP 5.3.5. pmcd runs fine for weeks, but at some point something makes it go nuts and it rapidly consumes file descriptors until it hits the limit, and then the log file fills up my disk partition very rapidly.
Kevin
-- Kevin Hildebrand University of Maryland
@ke3vin @mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:
From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)
A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'
I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.
Thanks, Kevin
On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:
@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>
Ah, at least on my host, it's 'pmproxy' that has the other end of those connections...
pmproxy 2312 pcp 1000u IPv6 5656573 0t0 TCP localhost:37042->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1001u IPv6 5669522 0t0 TCP localhost:37096->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1002u IPv6 5667108 0t0 TCP localhost:37150->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1003u IPv6 5671371 0t0 TCP localhost:37204->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1004u IPv6 5669581 0t0 TCP localhost:37258->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1005u IPv6 5669590 0t0 TCP localhost:37312->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1006u IPv6 5672226 0t0 TCP localhost:37366->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1007u IPv6 5670403 0t0 TCP localhost:37420->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1008u IPv6 5673068 0t0 TCP localhost:37474->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1009u IPv6 5669630 0t0 TCP localhost:37528->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1010u IPv6 5670467 0t0 TCP localhost:37582->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1011u IPv6 5670487 0t0 TCP localhost:37636->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1012u IPv6 5670512 0t0 TCP localhost:37690->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1013u IPv6 5669668 0t0 TCP localhost:37744->localhost:pmcd (ESTABLISHED)
On Mon, May 9, 2022 at 8:49 AM Kevin M. Hildebrand @.***> wrote:
I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:
From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)
A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'
I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.
Thanks, Kevin
On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:
@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.
If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>
Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.
Thanks, Kevin
@ke3vin Do you have a multiple-item params/target section in your scrape configs? It did help in my case to reduce or almost eliminate the occurrence of the problem when I removed those params and started scraping everything from the /metrics endpoints.
Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.
I've been flat out, sorry - and off on leave next week. If you can try to find a recipe that reproduces the failure as fast as possible, that'll help immensely when I get back (both for diagnosing and then testing fixes).
thanks.
@mac133k @ke3vin I've been trying (unsuccessfully) to reproduce this issue over the last couple of days. Submitting thousands of concurrent requests to and from pmproxy works just fine - I can see hundreds of new corrections arriving very quickly and then they slowly timeout or get reused, ultimately always ending up in the correct steady state of open fds - for both pmcd and pmproxy. I think I'm going to have to dig deeper into the Prometheus configurations being used to reproduce this. Can you provide prometheus.yml configuration snippets that show me how to mirror your setups? Thanks.
One of the mysteries to me is why pmcd is the one running out of fds (and why not also pmproxy) - AIUI in your setups you have a single Prometheus instance talking to many remote-host PCP setups (each remote host having one pmcd and one pmproxy, right?). In this situation we should see just a small number of fds for both pmcd and pmproxy. One fd between the two and one fd open to Prometheus for each host during a scrape - so Prometheus should have more open fds than anyone (assuming some form of keep-alive is happening there).
Any additional information that would help me reproduce this would be much appreciated - thanks!
Sorry for the delay, was in the middle of a big proposal. I should be back to being able to focus on this now. I've attached my prometheus config. We've been typically recovering from the issue by removing the large log files and restarting pmcd, but I've recently noticed that that appears to be leaving pmproxy in a broken state. Doing 'wget localhost:44323/metrics' returns 400 Bad Request. So pmproxy also needs to be restarted.
No problem @ke3vin - thanks for those config details. Any chance I could get a temporary login on one of the systems affected to attempt further diagnosis? (if this could be feasible, please contact me privately via my redhat.com address to set this up - thanks).
Has this issue been fixed? I've got repro systems that are on RHEL 8.4 with PCP updated to 5.3: rpm -qa | grep pcp pcp-pmda-hacluster-5.3.5-8.el8.x86_64 pcp-libs-5.3.5-8.el8.x86_64 pcp-conf-5.3.5-8.el8.x86_64 pcp-selinux-5.3.5-8.el8.x86_64 pcp-5.3.5-8.el8.x86_64
@rsponholtz not specifically, but there have been many changes in the meantime since this issue was opened (and since pcp-5.3.x) - a reproducible test case would be invaluable here.
VM information
- VM OS: Ubuntu 22.04
- Kernel: 5.15.0-87-generic
- pcp package: 5.3.6-1build1
We have got the same problem as reported by mac133k and ke3vin. A few dozen VMs reported disk full error and the storage was basically taken by pmcd.log filled with [time stamp] pmcd(1030) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files log entries.
head of pmcd.log shows the following:
active agent dom pid in out ver protocol parameters
============ === ===== === === === ======== ==========
root 1 1033 6 7 2 bin pipe cmd=/var/lib/pcp/pmdas/root/pmdaroot
pmcd 2 2 dso i:6 lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x7fe4047d1fa0]
proc 3 1034 9 10 2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3
xfs 11 1039 11 12 2 bin pipe cmd=/var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux 60 1044 13 14 2 bin pipe cmd=/var/lib/pcp/pmdas/linux/pmdalinux
pmproxy 4 2 dso i:7 lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=pmproxy_init [0x7fe4047c0c50]
mmv 70 2 dso i:7 lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=mmv_init [0x7fe4047c0c90]
jbd2 122 2 dso i:4 lib=/var/lib/pcp/pmdas/jbd2/pmda_jbd2.so entry=jbd2_init [0x7fe4047b7070]
kvm 95 1054 15 16 2 bin pipe cmd=/var/lib/pcp/pmdas/kvm/pmdakvm -d 95
Host access list:
00 01 Cur/MaxCons host-spec host-mask lvl host-name
== == =========== ======================================= ======================================= === ==============
y y 0 0 127.0.1.1 255.255.255.255 0 localhost
y y 0 0 / / 1 unix:
n 0 0 0.0.0.0 0.0.0.0 4 .*
n 0 0 :: :: 8 :*
User access list empty: user-based access control turned off
Group access list empty: group-based access control turned off
pmcd: PID = 1030, PDU version = 2
pmcd request port(s):
sts fd port family address
=== ==== ===== ====== =======
ok 4 unix /run/pcp/pmcd.socket
ok 0 44321 inet INADDR_ANY
ok 3 44321 ipv6 INADDR_ANY
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
syslog contains the following log entries. These pmie alerts maybe an indication of the start of the abnormal event. Could also be pmcd is running out of fds. pmcd process was at high utilization.
Apr 23 07:28:17 node-116 systemd[1]: pmie_farm_check.service: Deactivated successfully.
Apr 23 07:32:00 node-116 pcp-pmie[1811]: TCP listen overflows 0.0529275/sec@node-116
Apr 23 07:32:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 85431ctxsw/s@node-116
Apr 23 07:41:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.224999/sec@node-116
Apr 23 07:42:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 110349ctxsw/s@node-116
Apr 23 07:51:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.299987/sec@node-116
Apr 23 07:52:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 113741ctxsw/s@node-116
lsof -c pmcd | grep localhost | wc -l returns 1004 with output similar to the following:
pmcd 1030 pcp 1001u IPv4 26875416 0t0 TCP localhost:44321->localhost:44886 (ESTABLISHED)
lsof -c pmproxy | grep localhost | wc -l returns 1004 with the following type of output:
pmproxy 1119894 pcp 1012u IPv4 26875776 0t0 TCP localhost:33448->localhost:44321 (ESTABLISHED)
pcp is installed as is without any configuration changes. There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger.
Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.
One batch of hundreds of VMs does have any problems (over six months).
This 2nd batch of hundreds of VMs (created last month) is not so fortunate with dozens of VMs showing disk full error.