Speedup aiohttp web server
The typical web handler looks like:
async def handler(request):
... # do business logic
return web.Response(text=..., content_type=...)
Unfortunately construction of `web.Response` is relative slow, `resp.prepare()` is even slower.
The code has too many checks for overlapping `content_type` with `headers['Content-Type']`, response length with `headers['Content-Length']`, HTTP chunk processing etc. etc.
We can speed up the situation if we assume that `web.Response` is constructed with well knows single text or binary data.
The suggestion is:
- Introduce `web.text_response()` and `web.bin_response()` like already present `web.json_response()`. Functions will just make `web.Response` instances on this first step. The main point is teaching people to use the new API.
- `web.Response` should be kept as is for backward compatibility.
- After that we can refactor our response classes hierarchy by making a `BaseResponse` as parent for `StreamResponse` and deriving private classes for binary and text responses from `BaseResponse`. Obviously `BaseResponse.prepare()` should exist, not sure about other API attributes.
- These private classes can do as minimal job as possible, they don't need to pass redundant checks for conflicting HTTP headers.
We should forbid the following HTTP headers in `web.text_response()` / `web.bin_response()`:
- Connection
- Keep-Alive
- Trailer
- Transfer-Encoding
- Content-Length
- Upgrade
The check for headers is quite expensive, let's do it in debug mode only (`loop.get_debug() is True` or `PYTHONASYNCIODEBUG=1`.
That's it.
Discussion is welcome.
PR for the first step (adding `web.text_response()` and `web.bin_response()`) is even more welcome.
Also, the compression and the chunk features that are mixed between StreamResponse and StreamWriter doesn't help.
Allowing a clean happy path for some ad-hoc cases - that surely are the most used - as json_response, text_response, bin_resposne with a none prone error interface- as you mentioned - which each one will have a restricted set of parameters will help speeding up the response part.
I'm aligned with you, but I would try first to speed up the json_response one, taking into account that already has its own method. Hence, its a matter of start working on a new hierarchy of classes, and as a bonus/later move the compression and chunking entirely in the StreamResponse class.
I think it's OK to ignore any content-length and transfer-encoding settings for these responses, and always use the content-type specified by user if there is any.
But I don't quite understand the speedup, I thought building the headers are always slower...
the check for headers is quite expensive <- I don’t understand It’s just some comparing work and it’s fast. Looking up in a multidict is also fast
response.prepare() (and response._start()) take much longer time for simple responses in comparison as it should be.
I've created a gist for demonstration my steps to get profile results: https://gist.github.com/asvetlov/13295efdc1301bea241d7d35496e6f81
I hope you can reproduce snakeviz output by following the gist instructions (just snapshot in not informative because of interactive nature of the tool).
Actually it points to StreamResponse._start() and StreamResponse.content_length.
StreamWriter.write_headers() should be optimized too.
P.S. Multidict could be much faster as well, see https://github.com/aio-libs/multidict/issues/97 But it is "good enough" for now.
Also web.Response.__init__ is relative expensive.
My striving to support both text and binary responses in the same class was a mistake, as result we have too many checks in a constructor, just read the code.
How much do they cost comparing to a simple json encode?
How is json encoding related to the issue?
If it costs far less than json encode, it may not worth it.
Say again, why should I care about JSON encoding cost at all? JSON is not the onlly possible data structure for HTTP payload. Following this way should I compare reponse sending with making a request to database as well? To which database? With what network connections? (The same is true for JSON BTW, encoding cost depends from incoming data size, types, JSON library and passed flags).
Take it easy... When optimising, we begin from the slowest part. I have tested the performance of aiohttp before, It is about 4000qps/CPU core, that is 250us per request. We must reduce 50us or more to make a difference. And if I remember It correctly, encoding of a JSON response takes about 10us. So I was wondering if doing checks are really a bottleneck now.
I don't have a testing environment with me until next week. I will look at your gist after that.
@hubo1016 You can always use third-party fast json library like ujson, rapidjson or else in-place of stdlib ones today. That's out of scope of overall aiohttp optimizations which you couldn't replace with something faster today.
@kxpal understood. I am just using it as a benchmark baseline, since different hardware produces very different numbers.
4000 RPS? On my laptop I have 12000 RPS (using the single thread/process), see test files/params in gist above.
8200 RPS on my laptop
@thomaszdxsn python debugger is off, access log is off, uvloop is used?
https://gist.github.com/asvetlov/13295efdc1301bea241d7d35496e6f81
I take this gist, and not use cProfile.
python debugger is off
what mean? I know python have a debugger call pdb
Sorry, my mistake -- should be "profiler".
Python uses the same low-level tracing API for both debugging and profiling.
The question was about usingpython run_aiohttp.py 8080 to run server (no -m cProfile usage).
yes, this is my command:
$ pipenv run python run_aiohttp.py 8888
wrk command:
$ wrk -t 10 http://localhost:8888
output:
Running 10s test @ http://localhost:8888
10 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.17ms 358.79us 9.55ms 85.53%
Req/Sec 0.86k 94.33 1.73k 81.85%
85793 requests in 10.10s, 12.35MB read
Requests/sec: 8491.46
Transfer/sec: 1.22MB
another wrk:
$ wrk -t8 -c200 -d30s --latency http://localhost:8888
output:
Running 30s test @ http://localhost:8888
8 threads and 200 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 26.25ms 20.63ms 263.13ms 94.34%
Req/Sec 1.06k 272.09 1.44k 82.98%
Latency Distribution
50% 22.36ms
75% 27.56ms
90% 36.88ms
99% 125.95ms
252024 requests in 30.05s, 36.29MB read
Socket errors: connect 0, read 3, write 0, timeout 0
Requests/sec: 8387.49
Transfer/sec: 1.21MB
my computer:
MacBook Pro 2015
2.7 GHz Intel Core i5
8 GB 1867 MHz DDR3
maybe reason is MacOS
@asvetlov ah, I didn't turn off the access log. I just use the default settings.
It makes sense if you are targeting 20k+qps. Shall we rewrite the whole class with Cython first? That might be easier.
from my experience, biggest win in term of performance is related to ability to run async code in place
and return early if async function does not execute actual async operations or async result available immediately. at the moment aiohttp always run one task, I'd run handler as soon as full http message available in data_received callback and would start task only if handler returns not completed future.
that would require some low level generators code, but speedup should be significant, especially for benchmarks. we had such code on client side before 2.0, but I am not sure if it would work with async/await.
Good point
interesting @fafhrd91! Will be nice to see if the naive code path that has a completed response without needing a loop iteration exists right now in the current implementation, plenty of awaits that might be breaking the execution.
It will be nice to implement http/2 , that will give a really good performance boost on real world apps. As http/2 has a HPACK header compression, it is possible to use caching for header parsing, validation, and all other stuff. What will give blasting performance boost and top limit of uvloop tcp performance (100k rps) can be reached nearly, as http/2 is a binary protocol. And that's not all - almost all real world apps uses sessions, no matter what kind they are (kv, rdms, jwt) they all need some work for cpu, this work can bee cached to. And there is a ready implementation of http/2 with python binding , what would not be hard to integrate.
Are you volunteering for the task?
Maybe. If it will be requirement for my current project (we have some performance issues against heavy application level ddos). This task is interesting for me, but I cant give any promises about my part in it because there are always some factors that may break promises.
There has been a lot of improvement, but there is still a lot that could perform better
Without profiler % wrk -t 10 http://localhost:8080 Running 10s test @ http://localhost:8080 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 237.23us 22.55us 1.12ms 84.31% Req/Sec 4.20k 56.62 4.54k 78.81% 421797 requests in 10.10s, 63.56MB read Requests/sec: 41763.44 Transfer/sec: 6.29MB
With profiler % wrk -t 10 http://localhost:8080 Running 10s test @ http://localhost:8080 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 454.83us 34.72us 1.99ms 86.39% Req/Sec 2.20k 32.69 2.32k 75.64% 220865 requests in 10.10s, 33.28MB read Requests/sec: 21867.62 Transfer/sec: 3.30MB
I didn't expect add_app to be called every request
I didn't expect add_app to be called every request
I'm guessing that's UrlMappingMatchInfo.add_app(), which would make sense, because you get a new match info for each request and you want the app available on there. It just sets it as the current_app and adds it to an empty list (when routing through sub apps, then I think you end up with a list representing the chain of apps).
3.8.6 Requests/sec: 47324.70 Transfer/sec: 6.86MB
3.9.5 Requests/sec: 45880.58 Transfer/sec: 6.65MB
3.10.5 Requests/sec: 42387.28 Transfer/sec: 6.18MB
3.10 Requests/sec: 41336.35 Transfer/sec: 6.23MB