Client timeout includes connection pool queue wait time
Describe the bug
The issue lies in how aiohttp handles timeouts when using ClientSession and connection pooling. Currently, the timeout starts counting from the moment the coroutine session.get() is called, not from when the connection is acquired from the connection pool. This means that the time spent waiting in the connection pool queue is also included in the timeout calculation, leading to unexpected TimeoutError exceptions even if the actual request could be completed within the specified timeout.
This behavior can cause issues when the connection pool is overloaded with many simultaneous requests, as clients experience timeouts primarily due to delays in acquiring a connection rather than delays in the actual request processing.
To Reproduce
- Use the following server.py and client.py to replicate the issue.
- Run the server first and then execute the client.
- Observe the output and note the high count of
TimeoutError
server.py
import asyncio
from aiohttp import web
async def main(_):
await asyncio.sleep(1)
return web.Response()
app = web.Application()
app.add_routes([web.get('/', main)])
web.run_app(app)
client.py
import asyncio
import time
import statistics as stat
import aiohttp
async def request(session) -> tuple[bool | None, float]:
timeout_error = None
start_time = time.time()
try:
async with session.get('/'):
pass
timeout_error = False
except TimeoutError:
timeout_error = True
finally:
end_time = time.time()
duration = end_time - start_time
return timeout_error, duration
async def main(n: int=1000):
timeout = aiohttp.ClientTimeout(total=5)
async with aiohttp.ClientSession('http://127.0.0.1:8080', timeout=timeout) as session:
async with asyncio.TaskGroup() as tg:
tasks = [tg.create_task(request(session)) for _ in range(n)]
return [task.result() for task in tasks]
if __name__ == '__main__':
results: list[tuple[bool | None, float]] = asyncio.run(main())
assert len(results) == len([x for x in results if x[0] is not None])
durations = [x[1] for x in results]
print(f'Requests count: {len(results)}')
print(f'Timeout Error count: {len([x for x in results if x[0] is True])}')
print(f'First request time: {durations[0]}')
print(f'Last request time: {durations[-1]}')
print(f'Median request time: {stat.median(durations)}')
Run a server as follows:
python server.py
Then run a client:
python client.py
Despite setting the timeout to 5 seconds, many requests fail with TimeoutError. The observed behavior indicates that the timeout starts before acquiring a connection from the pool, rather than only including the time for the actual request.
Expected behavior
All requests will be executed successfully without a TimeoutError. The timeout should start counting only after a connection is acquired from the connection pool. This ensures that the specified timeout duration applies exclusively to the actual request-processing time, making the timeout behavior predictable and intuitive.
Logs/tracebacks
$ python client.py
Requests count: 1000
Timeout Error count: 500
First request time: 1.070129156112671
Last request time: 5.747848749160767
Median request time: 5.41119110584259
Python Version
3.13.0
aiohttp Version
3.11.11
multidict Version
6.1.0
propcache Version
0.2.1
yarl Version
1.18.3
OS
Debian GNU/Linux 11 (bullseye)
Related component
Client
Additional context
No response
Code of Conduct
- [X] I agree to follow the aio-libs Code of Conduct
I'm under the impression that is the intended design of the total timeout. i.e. It is the total time of everything.
I'm under the impression that is the intended design of the total timeout. i.e. It is the total time of everything.
-
First of all, this is not reflected in the documentation. Here's what the documentation( says about the
totalparameter:The maximal number of seconds for the whole operation including connection establishment, request sending and response reading.connection establishment, request sending and response reading- all of this is logical and intuitively clear, and it's what you expect when setting a timeout for a request execution, but nothing is mentioned here about the time spent waiting for a connection from the pool. This gives the impression that it's not "intended design" after all. -
Secondly, including the time spent waiting in the queue for a connection from the pool in the timeout is a very strange logic. When I set a timeout, I want to limit the request execution time "outside of the code" (network operations and the time spent processing the request on the server side). In the current timeout implementation, the first thing that comes to mind when encountering a
TimeoutErroris that the problem is on the server, that the server is taking too long to process the request, but that’s not the case. This is confusing. Also, with the current implementation, it’s impossible to properly limit request execution time during a high number of concurrent requests without adding extra logic to the user code, whereas this functionality should work "out of the box" and be intuitively expected from a feature like client timeout.
So all of this seems more like a bug rather than "intended design."
It could be clearer, but it does say "for the whole operation". I think it'd probably make sense to change it to behave the way you expected though, as an overall timeout can be easily achieved with an external asyncio.timeout() call from the user.
It may not be as trivial as you think to change the code though, so feel free to take a look and submit a PR if it easy enough (unless anyone else has any objections).
@Dreamsorcerer
It could be clearer, but it does say "for the whole operation".
"for the whole operation," in the current implementation, effectively means a timeout for a group of concurrent requests (as in the example I provided) within a single session. This contradicts the idea of being able to set a timeout for an individual request. If I needed to limit the execution time of a specific part of the code (e.g., a group of requests), I could easily do that myself in the user code. In this case, it happens against my will when I’m trying to solve an entirely different task, namely, to limit the execution time of an individual request.
I found an old issue #3203 where this problem is mentioned. It seems that @asvetlov acknowledges that this is confusing and plans to fix this behavior:
asvetlov commented on Oct 19, 2019
The timeout is per-request already, but it counts the time of waiting for a connection from the pool. You can configure the pool to unlimited size. Or you can use ClientTimeout(total=None, sock_connect=5, sock_read=5).
I found that existing timeout behavior is confusing, I want to fix it in newer aiohttp versions. For now, you can use workarounds provided above
@Dreamsorcerer
It may not be as trivial as you think to change the code though, so feel free to take a look and submit a PR if it easy enough (unless anyone else has any objections).
Perhaps I will do so, as I actively use aiohttp and don’t see an alternative for myself, but this issue causes me discomfort. I don’t want to use workarounds in user code to solve this problem, as they have their downsides, and ideally, the timeout should work as expected. Thank you for your involvement.
I agree the current behavior causes confusion.
Changing the total timeout behavior is going to be a significant breaking change as many downstream applications currently expect the request to be cancelled as soon as the total timeout is met regardless of any waits on the connection pool.
If we change the behavior here it could lead to cases where the timeout wait is now orders of magnitude longer because there are no free connections. We shouldn't backport a change like that to 3.x.
The unintended consequence of changing it means that we would likely see downstream code having to wrap all requests in asyncio.timeout instead of using the built-in timeout functionality. This would also mean we would need to make sure everything is cancellation safe.
+1 on the confusion, I did not expect this to be the case either.
@bdraco One option would be to do the breaking change on 4.x and backport it to 3.x as a separate property (total_excl_pool_queue?) with a deprecation warning + maybe a warning for the future change to total. It is definitely weird to introduce an instantly-deprecated property, but it would allow to bridge the gap.
This would also mean we would need to make sure everything is cancellation safe.
I always assumed this was already the case, as cancellation should pretty much always be a concern when designing any kind of asyncio library. But it might just be I got yelled at by flake8-async too much :p