deepflow icon indicating copy to clipboard operation
deepflow copied to clipboard

[BUG] Does deepflow-agent affect the performance of the Application ?

Open kwenzh opened this issue 2 years ago • 12 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

DeepFlow Component

Agent

What you expected to happen

deploy deepflow in k8s cluster, We have found a performance degradation in the programs within the cluster, including delays, QPS, and mainly some HTTP services and consumer mq tasks. Performance has decreased by about 40%.

How to reproduce

make a test case

Then I did a simple test, starting an http api and testing it with ab tools demo code:

#!/user/bin/env python
# -*- coding:utf-8 -*-
"""

"""

from BaseHTTPServer import HTTPServer, BaseHTTPRequestHandler
import urlparse
import json
import random
import string


class HTTPHandler(BaseHTTPRequestHandler):

    def do_GET(self):
        print self.path
        res = urlparse.urlparse(self.path)
        param = urlparse.parse_qs(res.query)

        resp = ""
        for k, v in param.items():
            if k == "num":
                val = v[0]
                rs = random.sample("xxxxsfsdffghehuwfnsajfisddjsddmidsa", 10)
                print k, v[0], rs
                resp = val + "".join(rs)


        self.send_response(200, "OK")
        self.end_headers()
        self.wfile.write(resp)


httserver = HTTPServer(("0.0.0.0", 8001), HTTPHandler)

print ">>>>>>>>>> start "
httserver.serve_forever()

client test eg: ab -n 2000 -c 10 http://k8s-ip:nodeport/get_num?num=22

no deepflow-agent result:

Server Software:        BaseHTTP/0.3
Server Hostname:        10.65.138.101
Server Port:            32246

Document Path:          /get_num?num=22
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   0.458 seconds
Complete requests:      2000
Failed requests:        0
Write errors:           0
Total transferred:      206000 bytes
HTML transferred:       24000 bytes
Requests per second:    4363.35 [#/sec] (mean)
Time per request:       2.292 [ms] (mean)
Time per request:       0.229 [ms] (mean, across all concurrent requests)
Transfer rate:          438.89 [Kbytes/sec] received


deeoflow-agent running:


Document Path:          /get_num?num=22
Document Length:        12 bytes

Concurrency Level:      10
Time taken for tests:   0.628 seconds
Complete requests:      2000
Failed requests:        0
Write errors:           0
Total transferred:      206000 bytes
HTML transferred:       24000 bytes
Requests per second:    3186.86 [#/sec] (mean)
Time per request:       3.138 [ms] (mean)
Time per request:       0.314 [ms] (mean, across all concurrent requests)
Transfer rate:          320.55 [Kbytes/sec] received



It looks like qps has decreased by 25%. I know deepflow-agent uses ebpf technology

DeepFlow version

deepflow version: v6.2.6 kerne version: 5.15.72 k8s version: v1.18.19

DeepFlow agent list

k8s cluster, echo node have a deepflow-agent pod deepflow-ctl agent list VTAP_ID NAME TYPE CTRL_IP CTRL_MAC STATE GROUP EXCEPTIONS 2 dev-szdl-k8s-slave-5.novalocal-V9 K8S_VM 10.x fe:fc:fe:03:45:ae NORMAL default 3 dev-szdl-k8s-slave-7.novalocal-V1 K8S_VM 10.x fe:fc:fe:68:15:78 NORMAL default 4 dev-szdl-k8s-slave-6.novalocal-V10 K8S_VM 10x fe:fc:fe:5c:b5:0f NORMAL default 5 dev-szdl-k8s-slave-1.novalocal-V7 K8S_VM 10.x fe:fc:fe:71:51:77 NORMAL default 6 dev-szdl-k8s-slave-4.novalocal-V2 K8S_VM 10.x fe:fc:fe:59:5f:6a NORMAL default 7 dev-szdl-k8s-slave-3.novalocal-V3 K8S_VM 10.x fe:fc:fe:6f:fa:eb NORMAL default 8 dev-szdl-k8s-slave-2.novalocal-V8 K8S_VM 10.x fe:fc:fe:43:0a:0b NORMAL default

Kubernetes CNI

calico

Operation-System/Kernel version

from 5.15.72 5.15.72-1.sdc.el7.elrepo.x86_64

Anything else

I know deepflow-agent uses ebpf technology

so I would like to confirm whether deepflow-agent will affect the linux kernel network forwarding performance or CPU performance of programs between clusters. For example, cpu scheduling and network forwarding

and the other test , there is a http post api, compared running DeepFlow-agent to not running in the test When DeepFlow-agent is running, the QPS drops from 5000+ to 2000.+ , it almost -50%

image2023-5-18_20-3-42

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

kwenzh avatar May 18 '23 12:05 kwenzh

diss: https://github.com/orgs/deepflowio/discussions/3183

kwenzh avatar May 18 '23 12:05 kwenzh

Hello, eBPF does have some performance overhead. I talked about it in my last live broadcast. We're currently sorting through the data, and we'll have the first version of the data publicly available soon. Also, can you close eBPF or eBPF uprobe to add wechat at the bottom of readme? Let's communicate on wechat.

Nick-0314 avatar May 24 '23 00:05 Nick-0314

close eBPF or eBPF

ok thank you. Will closing the eBPF probe have any effect? Like network topology monitoring capabilities?

kwenzh avatar May 24 '23 01:05 kwenzh

Network topology is not affected, but distributed tracing

Nick-0314 avatar May 24 '23 01:05 Nick-0314

https://mp.weixin.qq.com/s/oNrTG4ExNOvwV6luPaC4zA

We have some performance test data for reference, and you can also try to turn off the uprobe test of eBPF only @kwenzh

Nick-0314 avatar Jun 02 '23 03:06 Nick-0314

Hey guys,

I was doing some performance tests and I think deepflow agent is impacting the throughput and requests per second on a K8S cluster. How thet teste was done: I used a POD running a K6 script from the K8S cluster where is running to a nginx server running on a VM. All tests running Deepflow 6.3.5.

This is the result WITH the deepflow agent running As the image shows the script peaked on 11.37 req/s

image

This is result WITHOUT the deepflow agente running of the same script on the same cluster. I only deleted the daemon set. As the you can see, the test was able to reach 19.1k tests

image

I run this test twice, with and without the agent and the results were the same, very similar request per second. I am running it now for the third time I will post the results here in a few moments.

With agent running on K8S - test redo image

Without agent running on K8S - test redo image

With agent running on K8S - test redo 2 image

### Last 3 tests compared Here is a overview of the last 3 tests, as we can see, the response time increase considerably when the deepflow agent was running image

I hope I could help.

dirtyren avatar Aug 14 '23 20:08 dirtyren

Here is a overview of the last 3 tests, as we can see, the response time increase considerably when the deepflow agent was running

yes, the same to you , I tried to adjust the deepflow agent parameters, then a little better, maybe you can try it https://deepflow.io/docs/zh/install/advanced-config/agent-advanced-config/

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    disabled: true

kwenzh avatar Aug 15 '23 03:08 kwenzh

@dirtyren Alternatively, you can try shutting down the eBPF uprobe and testing again

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""

Nick-0314 avatar Aug 15 '23 03:08 Nick-0314

@dirtyren Alternatively, you can try shutting down the eBPF uprobe and testing again

vtap_group_id: g-d32cd8e4ef
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""

I did this config usingn deepflow-ctl but the dashboards are still showing eBPF sources in the last 5 minutes and the performance test yield the same results

vtap_group_id: g-3c66e436c9
log_level: ERROR
tap_interface_regex: '^(tap.*|gke.*|cali.*|veth.*|eth.*|en[ospx].*|lxc.*|lo|[0-9a-f]+_h)$'
external_agent_http_proxy_enabled: 1   # required
external_agent_http_proxy_port: 38086  # optional, default 38086
capture_packet_size: 2048
static_config:
  ebpf:
    uprobe-process-name-regexs:
      golang-symbol: ""
      golang: ""
      openssl: ""

dirtyren avatar Aug 15 '23 15:08 dirtyren

I think the capture_packet_size: 2048 solved my problem, the metrics are very similar with or without the deepflow-agent running

image

dirtyren avatar Aug 15 '23 17:08 dirtyren

I think the capture_packet_size: 2048 solved my problem, the metrics are very similar with or without the deepflow-agent running

image

yes, adjust capture_packet_size: 2048 have helps

kwenzh avatar Apr 15 '24 11:04 kwenzh

Should we close this?

dirtyren avatar Sep 27 '25 18:09 dirtyren