pcp icon indicating copy to clipboard operation
pcp copied to clipboard

pmcd Error: Too many open files fills up the log

Open mac133k opened this issue 4 years ago • 31 comments

I have about 30 VMs with CentOS 7 kernel v. 3.10.0-1160.42.2.el7.x86_64 and PCP version 5.3.5-1.x86_64 installed, core PMDAs only:

root	1	pipe	binary		/var/lib/pcp/pmdas/root/pmdaroot
pmcd	2	dso	pmcd_init	/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so
proc	3	pipe	binary		/var/lib/pcp/pmdas/proc/pmdaproc -d 3
pmproxy	4	dso	pmproxy_init	/var/lib/pcp/pmdas/mmv/pmda_mmv.so
xfs	11	pipe	binary		/var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux	60	pipe	binary		/var/lib/pcp/pmdas/linux/pmdalinux
mmv	70	dso	mmv_init	/var/lib/pcp/pmdas/mmv/pmda_mmv.so
kvm	95	pipe	binary		/var/lib/pcp/pmdas/kvm/pmdakvm -d 95
jbd2	122	dso	jbd2_init	/var/lib/pcp/pmdas/jbd2/pmda_jbd2.so

All of the hosts are only monitored with Prometheus. The problem I have been experiencing is that occasionally one of the host runs out of space on root partition due to pmcd.log filling up with messages like this one:

[Mon Feb 21 16:25:49] pmcd(99908) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files

It takes about 2 days for pmcd.log to grow from 0 to 30GB. As a workaround I set up daily log rotatation for all PCP logs:

/var/log/pcp/*/*.log {
    rotate 5
    daily
    missingok
    copytruncate
    su root root
    lastaction
        /bin/systemctl restart pmcd pmproxy
    endscript
}

This reduced the occurrence of the problem, but did not eliminate it. I noticed pmcd logs of 3GB in size where logrotation and service restart prevented a further growth, however today I saw pmcd.conf that grew up to 30GB and filled up the free space which which probably caused logrotate to fail, because logs of other service had not been rotated for 2 days as well.

I am not sure if this is related, but a day before the incident there was a segfault recorded in /var/log/messages:

Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]
Feb 18 23:47:21 node16 systemd: pmproxy.service: main process exited, code=killed, status=11/SEGV
Feb 18 23:47:21 node16 systemd: Unit pmproxy.service entered failed state.
Feb 18 23:47:21 node16 systemd: pmproxy.service failed.
Feb 18 23:47:22 node16 systemd: pmproxy.service holdoff time over, scheduling restart.
Feb 18 23:47:22 node16 systemd: Stopped Proxy for Performance Metrics Collector Daemon.
Feb 18 23:47:22 node16 systemd: Starting Proxy for Performance Metrics Collector Daemon...
Feb 18 23:47:22 node16 systemd: Started Proxy for Performance Metrics Collector Daemon.

It seems like PCP recovered from this error. About 4 hours later log rotation took place and pmcd was restarted as expected. About 6 hours after the rotation the first error was logged in pmcd.log, however system log did not record any other problems around that time. In pmproxy.log there were 2 errors logged without time stamps:

on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed

No errors were recorded in PMDA's logs.

If anyone is interested in investigating this issue I can provide more debug info if necessary.

mac133k avatar Feb 21 '22 18:02 mac133k

hi @mac133k - thanks for reporting this! It's a new one to us ...

pmcd.log filling up with messages like this one:

Could you capture contents of "sudo ls -l /proc/[pmcd PID]/fd" under these conditions and share? It'll be alot of output I imagine, so maybe a summary of what the file descriptors look like, or copy the whole lot to a github gist perhaps?

It takes about 2 days for pmcd.log to grow from 0 to 30GB.

Wow, OK. That log message is ~120bytes. If we work off 2 days, that's 172800 seconds - at 30GB, if my back-of-the-napkin math is correct you're getting this message logged ~1500 times per second ... ? I can't think of anything in PCP itself that might do that, so really interested to see that fd listing - it feels like it must be a rogue pmcd client rapidly attempting new connections.

| Feb 18 23:47:21 node16 kernel: pmproxy[74409]: segfault at 0 ip 00007f84bbb04eb0 sp 00007f84b5e92b00 error 4 in libuv.so.1.0.0[7f84bbafa000+2c000]

If you have a core file somewhere (abrt?) it'd be really handy to get a symbolic stack trace from this.

natoscott avatar Feb 22 '22 01:02 natoscott

I had another another occurrence of this problem. This time there was no segfault, so that must have been unrelated, unfortunately coredump was not available to look deeper into it. I did manage to capture more debug data for the main issue:

  • this time pmcd.log grew from 0 to 46GB in 32 hours.
  • there were 1024 FDs held by pmcd: ** 1012 sockets ** 13 pipes ** 1 file: pmcd.log
  • in /etc/security/limits.conf:
*                soft   nofile           1048576
*                hard   nofile           1048576
  • no other limits in /etc/security/limits.d
  • only one instance of each PCP process:
pcp       24331      1 80 Feb20 ?        1-23:05:32 /usr/libexec/pcp/bin/pmcd -A
root      24335  24331  5 Feb20 ?        03:22:57 /var/lib/pcp/pmdas/root/pmdaroot
root      24336  24335  5 Feb20 ?        03:02:23 /var/lib/pcp/pmdas/proc/pmdaproc -d 3
root      24338  24335  4 Feb20 ?        02:51:53 /var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
root      24339  24335  4 Feb20 ?        02:37:18 /var/lib/pcp/pmdas/linux/pmdalinux
root      24340  24335  3 Feb20 ?        02:20:07 /var/lib/pcp/pmdas/kvm/pmdakvm -d 95
pcp       24341      1  0 Feb20 ?        00:12:28 /usr/libexec/pcp/bin/pmproxy -F -A
  • pmproxy.log recorded:
on_write_callback: ERROR uv_write failed
on_write_callback: ERROR uv_write failed

mac133k avatar Feb 22 '22 14:02 mac133k

A few more details that I need to add:

  • VMs are monitored by Prometheus via PMProxy
  • there are 2 Prometheus instances scraping metrics every 30 seconds: ** one instance scrapes everything from /metrics which amounts up to 6.5k metric instances (half of them being cgroup metrics), scrape durations are between 1.2 and 1.6 seconds ** another instance scrapes only selected metrics:
- job_name: pcp
  honor_timestamps: true
  params:
    target:
    - kernel,mem,swap,network,filesys,disk,hinv

so the URL is http://< hostname >:44323/metrics?target=kernel%2Cmem%2Cswap%2Cnetwork%2Cfilesys%2Cdisk%2Chinv and the number of metric instances is about 1.5k which takes between 0.4 and 0.7 seconds to scrape.

PCP processes do not seem to be overwhelmed with scrape requests judging by the CPU and memory utilisation per process. I suspect using the target param may be somehow causing the issue, because I remember having a problem some time ago with PCP v5.0 or 5.1 where scrapes were incomplete or timing out until I changed Prometheus config and split the original job into multiple jobs with only one metric group as a target each. I'm going to kill the Prometheus instance that has targets in the config and see if it makes any difference here.

mac133k avatar Feb 23 '22 13:02 mac133k

| there were 1024 FDs held by pmcd:

Hmm, suspicious - seems like that maxfd setting of 1048576 hasn't taken effect? Might be worth checking it while pmcd is running via "cat /proc/[pmcd PID]/limits".

natoscott avatar Feb 23 '22 22:02 natoscott

Indeed:

# cat /proc/83150/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31128                31128                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31128                31128                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us 

However 1024 is way more than a healthy PMCD process needs, so I would rather dig deeper to find the root cause than lift this limit wherever it is set.

mac133k avatar Feb 24 '22 12:02 mac133k

Agreed, 1024 should be plenty. I wonder if we might be seeing a failure to cleanup timed-out connections between pmproxy and pmcd. You could experiment with larger values (well beyond your Prometheus scrape interval) of the polltimeout setting - in /etc/pcp/pmproxy/pmproxy.conf that would be:

[pmwebapi]
timeout = 120

(this is a value in seconds)

natoscott avatar Feb 24 '22 22:02 natoscott

Querying metric endpoints without the target params helped to significantly reduce the occurrence of this issue, but did not eliminate it completely.

mac133k avatar Apr 01 '22 12:04 mac133k

Just thought I'd chime in here- I'm having the same problem. Using prometheus, RedHat 8.5, and PCP 5.3.5. pmcd runs fine for weeks, but at some point something makes it go nuts and it rapidly consumes file descriptors until it hits the limit, and then the log file fills up my disk partition very rapidly.

Kevin

-- Kevin Hildebrand University of Maryland

ke3vin avatar May 06 '22 18:05 ke3vin

@ke3vin @mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

natoscott avatar May 09 '22 00:05 natoscott

I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:

From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)

A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'

I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.

Thanks, Kevin

On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:

@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>

ke3vin avatar May 09 '22 12:05 ke3vin

Ah, at least on my host, it's 'pmproxy' that has the other end of those connections...

pmproxy 2312 pcp 1000u IPv6 5656573 0t0 TCP localhost:37042->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1001u IPv6 5669522 0t0 TCP localhost:37096->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1002u IPv6 5667108 0t0 TCP localhost:37150->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1003u IPv6 5671371 0t0 TCP localhost:37204->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1004u IPv6 5669581 0t0 TCP localhost:37258->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1005u IPv6 5669590 0t0 TCP localhost:37312->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1006u IPv6 5672226 0t0 TCP localhost:37366->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1007u IPv6 5670403 0t0 TCP localhost:37420->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1008u IPv6 5673068 0t0 TCP localhost:37474->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1009u IPv6 5669630 0t0 TCP localhost:37528->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1010u IPv6 5670467 0t0 TCP localhost:37582->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1011u IPv6 5670487 0t0 TCP localhost:37636->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1012u IPv6 5670512 0t0 TCP localhost:37690->localhost:pmcd (ESTABLISHED) pmproxy 2312 pcp 1013u IPv6 5669668 0t0 TCP localhost:37744->localhost:pmcd (ESTABLISHED)

On Mon, May 9, 2022 at 8:49 AM Kevin M. Hildebrand @.***> wrote:

I haven't figured out how to duplicate it yet, but as I have it on about two dozen hosts, at least one of them seems to get in the situation almost daily. I've got one that way now. Lots of open sockets from localhost->localhost:

From lsof: (just a small chunk of the thousand+ open sockets) pmcd 2236 pcp 1009u IPv6 5670302 0t0 TCP localhost:pmcd->localhost:37096 (ESTABLISHED) pmcd 2236 pcp 1010u IPv6 5670335 0t0 TCP localhost:pmcd->localhost:37150 (ESTABLISHED) pmcd 2236 pcp 1011u IPv6 5670343 0t0 TCP localhost:pmcd->localhost:37204 (ESTABLISHED) pmcd 2236 pcp 1012u IPv6 5671407 0t0 TCP localhost:pmcd->localhost:37258 (ESTABLISHED) pmcd 2236 pcp 1013u IPv6 5670348 0t0 TCP localhost:pmcd->localhost:37312 (ESTABLISHED) pmcd 2236 pcp 1014u IPv6 5670397 0t0 TCP localhost:pmcd->localhost:37366 (ESTABLISHED) pmcd 2236 pcp 1015u IPv6 5673037 0t0 TCP localhost:pmcd->localhost:37420 (ESTABLISHED) pmcd 2236 pcp 1016u IPv6 5669608 0t0 TCP localhost:pmcd->localhost:37474 (ESTABLISHED) pmcd 2236 pcp 1017u IPv6 5675230 0t0 TCP localhost:pmcd->localhost:37528 (ESTABLISHED) pmcd 2236 pcp 1018u IPv6 5675262 0t0 TCP localhost:pmcd->localhost:37582 (ESTABLISHED) pmcd 2236 pcp 1019u IPv6 5669637 0t0 TCP localhost:pmcd->localhost:37636 (ESTABLISHED) pmcd 2236 pcp 1020u IPv6 5676375 0t0 TCP localhost:pmcd->localhost:37690 (ESTABLISHED)

A similar snippet from /proc/2236/fd lrwx------. 1 root root 64 May 9 08:48 979 -> 'socket:[5652133]' lrwx------. 1 root root 64 May 9 08:48 98 -> 'socket:[5245025]' lrwx------. 1 root root 64 May 9 08:48 980 -> 'socket:[5645176]' lrwx------. 1 root root 64 May 9 08:48 981 -> 'socket:[5655122]' lrwx------. 1 root root 64 May 9 08:48 982 -> 'socket:[5652150]' lrwx------. 1 root root 64 May 9 08:48 983 -> 'socket:[5645233]' lrwx------. 1 root root 64 May 9 08:48 984 -> 'socket:[5655916]' lrwx------. 1 root root 64 May 9 08:48 985 -> 'socket:[5655964]' lrwx------. 1 root root 64 May 9 08:48 986 -> 'socket:[5645247]' lrwx------. 1 root root 64 May 9 08:48 987 -> 'socket:[5661697]' lrwx------. 1 root root 64 May 9 08:48 988 -> 'socket:[5656244]' lrwx------. 1 root root 64 May 9 08:48 989 -> 'socket:[5661430]' lrwx------. 1 root root 64 May 9 08:48 99 -> 'socket:[5243452]' lrwx------. 1 root root 64 May 9 08:48 990 -> 'socket:[5661491]'

I'll leave the broken process running for a bit, if you've got specific stuff you want me to look for, let me know.

Thanks, Kevin

On Sun, May 8, 2022 at 8:11 PM Nathan Scott @.***> wrote:

@ke3vin https://github.com/ke3vin @mac133k https://github.com/mac133k could you do some analysis of the contents of the /proc/[pmcd PID]/fds directory when the issue occurs? I think that's got to be the first bit of information we need to diagnose further.

If someone can produce an exact recipe to induce the failure (ideally as fast as possible), please paste it here too. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/performancecopilot/pcp/issues/1533#issuecomment-1120514995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATLPEMH5L5UF2CUCJPBARDVJBJ2XANCNFSM5O7JC6TQ . You are receiving this because you were mentioned.Message ID: @.***>

ke3vin avatar May 09 '22 12:05 ke3vin

Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.

Thanks, Kevin

ke3vin avatar Jun 02 '22 17:06 ke3vin

@ke3vin Do you have a multiple-item params/target section in your scrape configs? It did help in my case to reduce or almost eliminate the occurrence of the problem when I removed those params and started scraping everything from the /metrics endpoints.

mac133k avatar Jun 02 '22 17:06 mac133k

Any additional thoughts on this? I'm having this issue almost daily when looking at ~20 machines. I need to add another 350 machines to monitoring soon and would like to get this resolved.

I've been flat out, sorry - and off on leave next week. If you can try to find a recipe that reproduces the failure as fast as possible, that'll help immensely when I get back (both for diagnosing and then testing fixes).

thanks.

natoscott avatar Jun 03 '22 05:06 natoscott

@mac133k @ke3vin I've been trying (unsuccessfully) to reproduce this issue over the last couple of days. Submitting thousands of concurrent requests to and from pmproxy works just fine - I can see hundreds of new corrections arriving very quickly and then they slowly timeout or get reused, ultimately always ending up in the correct steady state of open fds - for both pmcd and pmproxy. I think I'm going to have to dig deeper into the Prometheus configurations being used to reproduce this. Can you provide prometheus.yml configuration snippets that show me how to mirror your setups? Thanks.

One of the mysteries to me is why pmcd is the one running out of fds (and why not also pmproxy) - AIUI in your setups you have a single Prometheus instance talking to many remote-host PCP setups (each remote host having one pmcd and one pmproxy, right?). In this situation we should see just a small number of fds for both pmcd and pmproxy. One fd between the two and one fd open to Prometheus for each host during a scrape - so Prometheus should have more open fds than anyone (assuming some form of keep-alive is happening there).

Any additional information that would help me reproduce this would be much appreciated - thanks!

natoscott avatar Jun 21 '22 02:06 natoscott

Sorry for the delay, was in the middle of a big proposal. I should be back to being able to focus on this now. I've attached my prometheus config. We've been typically recovering from the issue by removing the large log files and restarting pmcd, but I've recently noticed that that appears to be leaving pmproxy in a broken state. Doing 'wget localhost:44323/metrics' returns 400 Bad Request. So pmproxy also needs to be restarted.

prometheus.yml.txt

ke3vin avatar Jun 24 '22 18:06 ke3vin

No problem @ke3vin - thanks for those config details. Any chance I could get a temporary login on one of the systems affected to attempt further diagnosis? (if this could be feasible, please contact me privately via my redhat.com address to set this up - thanks).

natoscott avatar Jun 27 '22 06:06 natoscott

Has this issue been fixed? I've got repro systems that are on RHEL 8.4 with PCP updated to 5.3: rpm -qa | grep pcp pcp-pmda-hacluster-5.3.5-8.el8.x86_64 pcp-libs-5.3.5-8.el8.x86_64 pcp-conf-5.3.5-8.el8.x86_64 pcp-selinux-5.3.5-8.el8.x86_64 pcp-5.3.5-8.el8.x86_64

rsponholtz avatar Apr 26 '24 16:04 rsponholtz

@rsponholtz not specifically, but there have been many changes in the meantime since this issue was opened (and since pcp-5.3.x) - a reproducible test case would be invaluable here.

natoscott avatar Apr 28 '24 21:04 natoscott

VM information

  • VM OS: Ubuntu 22.04
  • Kernel: 5.15.0-87-generic
  • pcp package: 5.3.6-1build1

We have got the same problem as reported by mac133k and ke3vin. A few dozen VMs reported disk full error and the storage was basically taken by pmcd.log filled with [time stamp] pmcd(1030) Error: AcceptNewClient(0): Unexpected error from __pmAccept: 24: Too many open files log entries.

head of pmcd.log shows the following:


active agent dom   pid  in out ver protocol parameters
============ === ===== === === === ======== ==========
root           1  1033   6   7   2 bin pipe cmd=/var/lib/pcp/pmdas/root/pmdaroot
pmcd           2                 2 dso i:6  lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x7fe4047d1fa0]
proc           3  1034   9  10   2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3
xfs           11  1039  11  12   2 bin pipe cmd=/var/lib/pcp/pmdas/xfs/pmdaxfs -d 11
linux         60  1044  13  14   2 bin pipe cmd=/var/lib/pcp/pmdas/linux/pmdalinux
pmproxy        4                 2 dso i:7  lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=pmproxy_init [0x7fe4047c0c50]
mmv           70                 2 dso i:7  lib=/var/lib/pcp/pmdas/mmv/pmda_mmv.so entry=mmv_init [0x7fe4047c0c90]
jbd2         122                 2 dso i:4  lib=/var/lib/pcp/pmdas/jbd2/pmda_jbd2.so entry=jbd2_init [0x7fe4047b7070]
kvm           95  1054  15  16   2 bin pipe cmd=/var/lib/pcp/pmdas/kvm/pmdakvm -d 95

Host access list:
00 01 Cur/MaxCons host-spec                               host-mask                               lvl host-name
== == =========== ======================================= ======================================= === ==============
 y  y     0     0 127.0.1.1                               255.255.255.255                           0 localhost
 y  y     0     0 /                                       /                                         1 unix:
    n     0     0 0.0.0.0                                 0.0.0.0                                   4 .*
    n     0     0 ::                                      ::                                        8 :*
User access list empty: user-based access control turned off
Group access list empty: group-based access control turned off


pmcd: PID = 1030, PDU version = 2
pmcd request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok     4       unix   /run/pcp/pmcd.socket
  ok     0 44321 inet   INADDR_ANY
  ok     3 44321 ipv6   INADDR_ANY
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files
__pmLogFindLocalPorts: scandir: Too many open files

syslog contains the following log entries. These pmie alerts maybe an indication of the start of the abnormal event. Could also be pmcd is running out of fds. pmcd process was at high utilization.

Apr 23 07:28:17 node-116 systemd[1]: pmie_farm_check.service: Deactivated successfully.
Apr 23 07:32:00 node-116 pcp-pmie[1811]: TCP listen overflows 0.0529275/sec@node-116
Apr 23 07:32:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 85431ctxsw/s@node-116
Apr 23 07:41:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.224999/sec@node-116
Apr 23 07:42:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 110349ctxsw/s@node-116
Apr 23 07:51:25 node-116 pcp-pmie[1811]: TCP listen overflows 0.299987/sec@node-116
Apr 23 07:52:35 node-116 pcp-pmie[1811]: High aggregate context switch rate 113741ctxsw/s@node-116

lsof -c pmcd | grep localhost | wc -l returns 1004 with output similar to the following:

pmcd    1030  pcp 1001u  IPv4           26875416          0t0     TCP localhost:44321->localhost:44886 (ESTABLISHED)

lsof -c pmproxy | grep localhost | wc -l returns 1004 with the following type of output:

pmproxy 1119894  pcp 1012u     IPv4           26875776      0t0     TCP localhost:33448->localhost:44321 (ESTABLISHED)

pcp is installed as is without any configuration changes. There is no monitoring or any other non-pcp related process accessing pmcd/pmproxy/pmlogger.

Can't figure why there are so many connections between pmcd and pmproxy without an apparent triggering event.

One batch of hundreds of VMs does have any problems (over six months).

This 2nd batch of hundreds of VMs (created last month) is not so fortunate with dozens of VMs showing disk full error.

weistonedawei avatar May 10 '24 21:05 weistonedawei