perf: revert NewChunkedBodyWriter
What type of PR is this?
Check the PR title.
- [x] This PR title match the format: <type>(optional scope): <description>
- [x] The description of this PR title is user-oriented and clear enough for others to understand.
- [ ] Attach the PR updating the user documentation if the current PR requires user awareness at the usage level. User docs repo
(Optional) Translate the PR title into Chinese.
(Optional) More detailed description for this PR(en: English/zh: Chinese).
en: zh(optional): 取消池化 NewChunkedBodyWriter
cpu 时间出现较大差异
0.7.1 和 0.6.1 版本的 hertz 进行比较, cpu 时间相差 8 倍
性能出现劣化
=======================================================
hertz 0.6.1
=======================================================
Running 1m test @ http://localhost:3000/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 23.68ms 17.15ms 307.82ms 83.41%
Req/Sec 1.86k 1.01k 4.68k 60.13%
Latency Distribution
50% 19.53ms
75% 28.76ms
90% 42.53ms
99% 88.00ms
776720 requests in 1.00m, 705.20MB read
Socket errors: connect 212, read 0, write 0, timeout 0
Requests/sec: 12923.79
Transfer/sec: 11.73MB
=======================================================
hertz 0.7.1
=======================================================
Running 1m test @ http://localhost:3000/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 29.03ms 21.50ms 312.42ms 79.90%
Req/Sec 1.76k 816.53 4.57k 65.35%
Latency Distribution
50% 23.47ms
75% 37.21ms
90% 55.44ms
99% 109.00ms
630186 requests in 1.00m, 572.16MB read
Socket errors: connect 212, read 0, write 0, timeout 0
Requests/sec: 10488.32
Transfer/sec: 9.52MB
从上面的 wrk 压测结果看出, 0.7.1 版本的 hertz 的各项指标均出现劣化
runtime.SetFinalizer 使用的并不恰当
添加 runtime.SetFinalizer 的原意是用于兜底回收, 但实际上带来了性能劣化
https://lemire.me/blog/2023/05/19/the-absurd-cost-of-finalizers-in-go/
在上面的博客中谈到, 使用 finalizer 会显著增加对象的创建和销毁的时间,大约是不使用 finalizer 的 430 倍。他建议避免使用 finalizer,除非有非常强烈的理由
(Optional) Which issue(s) this PR fixes:
(Optional) The PR that updates user documentation:
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
9c3f0b7) 81.81% compared to head (c0b8ca4) 81.78%.
:exclamation: Current head c0b8ca4 differs from pull request most recent head 240dfd5. Consider uploading reports for the commit 240dfd5 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## develop #993 +/- ##
===========================================
- Coverage 81.81% 81.78% -0.03%
===========================================
Files 98 98
Lines 9962 9949 -13
===========================================
- Hits 8150 8137 -13
Misses 1319 1319
Partials 493 493
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
增加一个revert前后的perf火焰图+benchmark数据看看哈
Flame graph
before:
after:
wrk 压测
==========================================
hertz before
==========================================
Running 1m test @ http://localhost:3000/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 31.14ms 42.08ms 788.81ms 97.67%
Req/Sec 2.40k 375.39 6.64k 84.98%
Latency Distribution
50% 25.94ms
75% 27.11ms
90% 28.86ms
99% 211.20ms
1151229 requests in 1.00m, 1.34GB read
Socket errors: connect 0, read 724, write 0, timeout 512
Requests/sec: 19217.84
Transfer/sec: 22.93MB
==========================================
hertz after
==========================================
Running 1m test @ http://localhost:3000/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 25.49ms 1.89ms 105.93ms 79.05%
Req/Sec 2.52k 160.92 3.40k 79.71%
Latency Distribution
50% 25.57ms
75% 26.40ms
90% 27.33ms
99% 29.93ms
1203313 requests in 1.00m, 1.37GB read
Socket errors: connect 0, read 693, write 0, timeout 0
Requests/sec: 20037.59
Transfer/sec: 23.30MB
Flame graph
before:
after:
wrk 压测
========================================== hertz before ========================================== Running 1m test @ http://localhost:3000/sse 8 threads and 512 connections Thread Stats Avg Stdev Max +/- Stdev Latency 31.14ms 42.08ms 788.81ms 97.67% Req/Sec 2.40k 375.39 6.64k 84.98% Latency Distribution 50% 25.94ms 75% 27.11ms 90% 28.86ms 99% 211.20ms 1151229 requests in 1.00m, 1.34GB read Socket errors: connect 0, read 724, write 0, timeout 512 Requests/sec: 19217.84 Transfer/sec: 22.93MB ========================================== hertz after ========================================== Running 1m test @ http://localhost:3000/sse 8 threads and 512 connections Thread Stats Avg Stdev Max +/- Stdev Latency 25.49ms 1.89ms 105.93ms 79.05% Req/Sec 2.52k 160.92 3.40k 79.71% Latency Distribution 50% 25.57ms 75% 26.40ms 90% 27.33ms 99% 29.93ms 1203313 requests in 1.00m, 1.37GB read Socket errors: connect 0, read 693, write 0, timeout 0 Requests/sec: 20037.59 Transfer/sec: 23.30MB@welkeyever 效果提升明显
benchmark @Skyenought
=============================================
after
=============================================
goos: linux
goarch: amd64
pkg: github.com/cloudwego/hertz/pkg/protocol/http1/resp
cpu: Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
BenchmarkSth
BenchmarkSth-8 131325 9104 ns/op
PASS
==============================================
before
==============================================
goos: linux
goarch: amd64
pkg: github.com/cloudwego/hertz/pkg/protocol/http1/resp
cpu: Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
BenchmarkSth
BenchmarkSth-8 130406 12306 ns/op
PASS
感谢,我晚些时候看看,影响面应该在使用内置NewChunkedBodyWriter劫持response body writer做按需flush的场景。
diff看着不太清楚,可以rebase压缩成一个commit吗?
使用wrk本地测了一下sse的quickstart示例,with pool vs no pool分别做了两轮测试,看起来有池化和没有池化对时延的影响并不明显:
env: go version go1.18.6 darwin/arm64 CPU测试水位 80-90%
稍后我再确认一下CPU采样和更仔细的CPU水位对比
过压状态结论相似:
@Skyenought 或许可以分享一下你上面的压测程序,我本地再尝试一下
sse.zip 脚本在 example/server/perf, 如果结果没有出入, 可能是电脑问题
过压状态结论相似:
看起来池化后是微裂化?
Running 15s test @ http://localhost:8888/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.33ms 2.89ms 80.33ms 84.91%
Req/Sec 4.09k 729.42 13.61k 73.92%
Latency Distribution
50% 7.52ms
75% 8.36ms
90% 9.09ms
99% 17.72ms
490053 requests in 15.08s, 444.93MB read
Socket errors: connect 270, read 109, write 0, timeout 0
Requests/sec: 32499.88
Transfer/sec: 29.51MB
=====================================
hertz version: with_pool
=====================================
Running 30s test @ http://localhost:8888/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.20ms 2.81ms 125.41ms 85.01%
Req/Sec 4.13k 2.67k 12.25k 63.47%
Latency Distribution
50% 7.21ms
75% 8.04ms
90% 9.00ms
99% 17.83ms
988571 requests in 30.09s, 0.88GB read
Socket errors: connect 270, read 0, write 0, timeout 0
Requests/sec: 32855.03
Transfer/sec: 29.83MB
Running 15s test @ http://localhost:8888/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.27ms 2.86ms 61.19ms 83.61%
Req/Sec 4.14k 664.75 10.79k 71.17%
Latency Distribution
50% 7.44ms
75% 8.33ms
90% 9.13ms
99% 17.45ms
495576 requests in 15.07s, 449.94MB read
Socket errors: connect 269, read 110, write 0, timeout 0
Requests/sec: 32888.66
Transfer/sec: 29.86MB
=====================================
hertz version: no_pool
=====================================
Running 30s test @ http://localhost:8888/sse
8 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.34ms 3.52ms 80.89ms 86.15%
Req/Sec 4.09k 1.90k 16.99k 73.48%
Latency Distribution
50% 7.30ms
75% 8.28ms
90% 9.41ms
99% 21.92ms
980559 requests in 30.10s, 0.87GB read
Socket errors: connect 269, read 34, write 0, timeout 0
Requests/sec: 32574.44
Transfer/sec: 29.57MB
看起来确实没啥区别
P.S @Skyenought 你的脚本后面执行结束后可以增加一个 kill pid 的逻辑,一旦忘记手动kill的话其实就是在测之前的 server 🤓 FYI:
#!/bin/bash
rm *.out
rm perf_server
go build -o perf_server main.go
echo "perf_server started"
nohup ./perf_server &
s_pid=$!
echo "pid of perf_server is ${s_pid}"
sleep 1
echo "warmup"
wrk -H "Connection: keep-alive" --latency -c 512 -d 15 --timeout 8 -t 8 http://localhost:8888/sse >> result.txt
sleep 5
echo " " >> result.txt
echo " ===================================== " >> result.txt
echo " hertz version: $1" >> result.txt
echo " ===================================== " >> result.txt
wrk -H "Connection: keep-alive" --latency -c 512 -d 30 --timeout 8 -t 8 http://localhost:8888/sse >> result.txt
kill $s_pid
echo "finished testing"
那没有问题的话, 把我这个 pr 关闭吧
那没有问题的话, 把我这个 pr 关闭吧
验证中,可以先保留


