Response.content iterates in needlessly small chunks
Response.content iterates over the response data in chunks of 10240 bytes. The number 10240 was set in commit 62d2ea8.
After tracing the source code of urllib3 and httplib, I can’t see a reason for this behavior. It all ultimately goes through httplib’s HTTPResponse.readinto, which automatically limits the read size according to Content-Length or the chunked framing.
Therefore, it seems that, if you simply set CONTENT_CHUNK_SIZE to a much larger number (like 10240000), nothing should change, except Response.content will become more efficient on large responses.
Update: it seems like httplib allocates a buffer of the requested size (to be read into), so simply setting CONTENT_CHUNK_SIZE to a large value will cause large chunks of memory to be allocated, which is probably a bad idea.
This is not a problem for me and I have not researched it thoroughly. I’m filing this issue after investigating a Stack Overflow question where this caused an unexpected slowdown for the poster, and a subsequent IRC exchange with @Lukasa. Feel free to do (or not do) whatever you think is right here.
It's good to know that httplib allocates a buffer of that size. I think we can probably stretch to double that buffer though: 20kb of buffer is unlikely to be the end of the world.
At the very least, though, we should understand how this works so that we can write documentation to explain this.
Originally, I iterated over a chunk size of 1 :)
While we're on the topic, we have 4 different default chunk_sizes between all of our iterator functions in Requests. Some I can find reasoning for (CONTENT_CHUNK_SIZE vs.
ITER_CHUNK_SIZE), but others like __iter__ and the default for iter_content aren't entirely clear.
I'm not saying these are wrong, just curious why __iter__ is declared as 128 instead of ITER_CHUNK_SIZE and if there's a reason for still having such a default of 1 on iter_content. Is it related to blocking or file-objects not returning without a full read?
There is a long, long issue to look at in the backlog. Anyone wanting to make progress on this needs to read and understand #844. Safe to say this is not a good choice for someone who doesn't want to find a really tough slog of a job.
just a ping back from the pip project on this 12 years old bug. :)
the iter_content() was set to 10240 bytes 12 years ago in requests. it's a needlessly small size and incur a lot of overhead. linked bug ticket: real bug in pip, 30% of the time taken by pip to download packages, was just overhead because of using this default chunk size.
I'm quite curious if there is any reason that prevents from updating CONTENT_CHUNK_SIZE to something more reasonable nowadays?
64k-128k-256k would be reasonable values for I/O.
On Linux, the network read buffer was increased to 64k in kernel v4.20, year 2018, the read and write buffer were 16k historically before that. (they're resized dynamically with the TCP window up to 4MB write 6M read, but let's not get into TCP window sizing, see sysctl_tcp_rmem sysctl_tcp_wmem) linux code: https://github.com/torvalds/linux/blame/master/net/ipv4/tcp.c#L4775 commit Sep 2018: https://github.com/torvalds/linux/commit/a337531b942bd8a03e7052444d7e36972aac2d92