Running derived timeseries query on pmproxy causes segfault

Open sjmonson opened this issue 3 years ago • 1 comments

Description

We are using the grafana-pcp plugin to query timeseries data from PCP. Querying for a derived field, such as mem.util.active{hostname == "$hostname"} / mem.physmem{hostname == "$hostname"}, causes pmproxy to segfault.

Segfaults are: pmproxy[924651]: segfault at 19d0e3 ip 00005647510e2a76 sp 00007ffd0c982bb0 error 4 in pmproxy[5647510d1000+1f000] Or occasionally: pmproxy[929076] general protection fault ip:558b5b6c6a76 sp:7ffdba257e30 error:0 in pmproxy[558b5b6b5000+1f000]

Versions

$ head -n2 /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"

$ rpm -qa | grep pcp
pcp-conf-5.3.7-1.x86_64
pcp-pmda-dm-5.3.7-1.x86_64
pcp-pmda-netfilter-5.3.7-1.x86_64
grafana-pcp-3.2.0-1.el8.x86_64
python3-pcp-5.3.7-1.x86_64
pcp-zeroconf-5.3.7-1.x86_64
pcp-pmda-systemd-5.3.7-1.x86_64
pcp-selinux-5.3.7-1.x86_64
pcp-pmda-openmetrics-5.3.7-1.x86_64
pcp-pmda-bonding-5.3.7-1.x86_64
pcp-pmda-smart-5.3.7-1.x86_64
pcp-pmda-redis-5.3.7-1.x86_64
pcp-libs-5.3.7-1.x86_64
pcp-pmda-nfsclient-5.3.7-1.x86_64
cockpit-pcp-264.1-1.el8.x86_64
pcp-pmda-rsyslog-5.3.7-1.x86_64
pcp-pmda-lmsensors-5.3.7-1.x86_64
pcp-pmda-sockets-5.3.7-1.x86_64
pcp-pmda-mounts-5.3.7-1.x86_64
pcp-system-tools-5.3.7-1.x86_64
pcp-pmda-nginx-5.3.7-1.x86_64
pcp-pmda-podman-5.3.7-1.x86_64
pcp-5.3.7-1.x86_64
pcp-doc-5.3.7-1.noarch
pcp-pmda-elasticsearch-5.3.7-1.x86_64

$ pcp
Performance Co-Pilot configuration on $hostname:

 platform: Linux $hostname 4.18.0-372.16.1.el8_6.x86_64 #1 SMP Tue Jun 28 03:02:21 EDT 2022 x86_64
 hardware: 20 cpus, 10 disks, 2 nodes, 386794MB RAM
 timezone: UTC
 services: pmcd pmproxy
     pmcd: Version 5.3.7-1, 23 agents, 12 clients
     pmda: root pmcd proc pmproxy xfs redis podman linux nfsclient mmv
           mounts lmsensors kvm netfilter rsyslog elasticsearch systemd
           nginx jbd2 dm openmetrics smart sockets

Stack Trace

Process 2309945 (pmproxy) of user 991 dumped core.

                Stack trace of thread 2309945:
                #0  0x000055612219f8e4 pmseries_log (pmproxy)
                #1  0x00007f59e9ce72d6 series_calculate_binary_check (libpcp_web.so.1)
                #2  0x00007f59e9ceb851 series_calculate.part.15 (libpcp_web.so.1)
                #3  0x00007f59e9cf145d series_query_funcs_report_values (libpcp_web.so.1)
                #4  0x00007f59e9ce7663 series_query_end_phase (libpcp_web.so.1)
                #5  0x00007f59e9d0254b redisSlotsReplyCallback (libpcp_web.so.1)
                #6  0x00007f59e9d1a310 redisClusterAsyncCallback (libpcp_web.so.1)
                #7  0x00007f59e9d0ccf5 redisProcessCallbacks (libpcp_web.so.1)
                #8  0x00007f59e9d02319 redisLibuvPoll (libpcp_web.so.1)
                #9  0x00007f59e9ac4d15 uv__io_poll (libuv.so.1)
                #10 0x00007f59e9ab3a74 uv_run (libuv.so.1)
                #11 0x00005561221985c2 main_loop (pmproxy)
                #12 0x000055612219798e main (pmproxy)
                #13 0x00007f59e8c15cf3 __libc_start_main (libc.so.6)
                #14 0x0000556122197d5e _start (pmproxy)

Aug 01 '22 22:08 sjmonson

@scrufulufugus thanks for the report - I've reproduced the problem locally now.

It looks like the failure is on an error path, the original problem relates to the number of samples being different between the two metrics in the expression. The crash is a cascading error because something is not correctly setup when we're logging the original error. I'll keep digging to understand why these things have happened, but should have a fix before too long since I can easily reproduce it anyway.

Aug 02 '22 02:08 natoscott