Support S3 transferring use "ProcessPoolExecutor" with s3tranfer
Describe the feature
As I know, aws-cli uses "s3tranfer" to use s3 and it uses ThreadPoolExecutor like this
https://github.com/boto/s3transfer/blob/da68b50bb5a6b0c342ad0d87f9b1f80ab81dffce/s3transfer/futures.py#L402-L403
In some environment like enough available network bandwidth, enough CPU cores and lots for files to be downloaded, then using ProcessPoolExecutor would be better.
And s3transfer has implemented an interface to use ProcessPoolExecutor
https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py
So I think addding feature flag for selecting thread or process for using s3 that could be determined by user would be better.
Use Case
If we have to download many files and the environment that using aws-cli has enough resources(CPU, Memory, Network bandwitdh) than we can choice that use more CPUs for boost the S3 throughput
Proposed Solution
No response
Other Information
No response
Acknowledgements
- [ ] I may be able to implement this feature request
- [ ] This feature might incur a breaking change
CLI version used
2.15.30
Environment details (OS name and version, etc.)
Amazon Linux 2023
Thanks for the feature request, we can review with the team. In the meantime can you provide any more details on your use case and the results you're seeing? Have you tried setting any of the S3 configurations documented here to optimize downloads: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html ?
@tim-finnigan
I reviewed the documentation and tried increasing the max_concurrent_requests to improve performance.
For example, I tested this on a c7g.16xlarge instance, which has a network interface capable of 30Gbps bandwidth. I set max_concurrent_requests to 64, matching the number of vCPUs, but the download speed didn’t improve as much as I expected.
Since s3transfer uses ThreadPoolExecutor by default, it might be helpful to give users the option to use ProcessPoolExecutor. This way, users with more CPU resources available could potentially speed up their downloads.
In my tests, using ProcessPoolExecutor for parallel downloads from S3 with boto3, I was almost able to fully use the 30Gbps bandwidth—something that wasn’t possible with ThreadPoolExecutor.
I think adding an option for ProcessPoolExecutor could help achieve download speeds similar to tools like s5cmd.
Hello @kimsehwan96 thanks for the reach out. For information, how are you trying the test? Are there any modifications you have done or modifying the SDK? Any test results, logs, debug logs and steps for replication? Thanks
@adev-code Hello, I tested it without boto3 sdk code modification, then just create some python code with boto3 and ProcessPoolExecutor. (Here is my article about it written with korean) : https://www.kimsehwan96.com/s3-donwload-sync-async-thread-process-perfomance-comparision/
I think there may be something that has been tested incorrectly.
Also, I tried to change the BoundedExecutor(https://github.com/boto/s3transfer/blob/9a168299c932077e665a618bfa5e2d5e39343745/s3transfer/futures.py#L406-L411) EXECUTOR_CLS with ProcessPoolExecutor but it was not worked. (I tested it about last year)
@kimsehwan96, thanks for reply. Would you have a minimal reproducible code?
@adev-code I changed EXECUTOR_CLS into futures.ProcessPoolExecutor in s3transfer/futures.py to test it with local enviornment(install boto3 and change the local s3transfer dependency codes) but it doesn't work. (https://github.com/boto/s3transfer/blob/f4341ad2ce17f4253de60d4678b97aabfd9d2b81/s3transfer/futures.py#L425-L426)
I think s3trasnfer/processpool.py(https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py) is already implemented but it is not used in boto3 s3 related codes, so I think that if someone want to use multi processes to speed up s3 download, it would be better to serve some flags to use it(multi process s3 donwload feature)