[BUG] Performance has decreased by three times
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
DeepFlow Component
Agent
What you expected to happen
our agent config file is like this:
vtap_group_id: g-e7f3f8db93 tap_interface_regex: .* process_threshold: 30 external_agent_http_proxy_enabled: 1 external_agent_http_proxy_port: 38086 static_config: ebpf: thread-num: 5 on-cpu-profile: disabled: true l7-protocol-enabled:
- HTTP ## for both HTTP and HTTP_TLS
- MySQL
- Redis
- Kafka l7_log_collect_nps_threshold: 100000 thread_threshold: 1000 l4_log_tap_types:
- 0 l7_log_packet_size: 1500 http_log_trace_id: traceparent,sw8,x-b3-traceid http_log_span_id: traceparent, sw8,x-b3-traceid http_log_proxy_client: 关闭
When we started stress testing, after deploying the deepflow agent in the cluster, we found that the average response time of Java microservices decreased by three times, from over 100 MS to 400 MS. We have removed protocols that we may not use in the agent and tried to increase CPU and memory configurations, but it did not help We even tried l4_ Log_ Tap_ Types: -1, but still not helpful We are using the latest version of 6.4.3 (agent/server). Is this the original performance of the agent?
How to reproduce
No response
DeepFlow version
6.4.3
DeepFlow agent list
Daemonsets 7 pods
Kubernetes CNI
Antrea
Operation-System/Kernel version
4.1.19
Anything else
No response
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hello, there are some questions about this issue that need to be clarified:
- What tool did you use for the benchmark? We have used wrk2 (https://github.com/giltene/wrk2), but found that this tool distorts RT under extreme TPS pressure.
- Which testing method did you use: A) Fixed TPS, testing with and without running deepflow-agent; B) Testing maximum TPS in scenarios with and without running deepflow-agent. If it's the first method, please confirm that no single logical core ran at 100% during the test, as reaching 100% could likely cause a significant decay in RT. If it's the second method, it usually means that definitely a single core ran at 100%. We typically use the first method for testing and ensure that no single core runs at 100% while running deepflow-agent. For our testing method, please refer to: https://deepflow.io/blog/zh/030-deepflow-agent-ebpf-benchmark/
- What was the TPS during the stress test?
- During the testing process, how was the overall CPU usage and system load of the machine?
- Can you confirm that your kernel version is 4.1.19? I'm concerned it might be a typo, as it looks like 4.19.
Due to the lack of response for a long time, this issue will be closed.