aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

Support S3 transferring use "ProcessPoolExecutor" with s3tranfer

Open kimsehwan96 opened this issue 1 year ago • 4 comments

Describe the feature

As I know, aws-cli uses "s3tranfer" to use s3 and it uses ThreadPoolExecutor like this https://github.com/boto/s3transfer/blob/da68b50bb5a6b0c342ad0d87f9b1f80ab81dffce/s3transfer/futures.py#L402-L403

In some environment like enough available network bandwidth, enough CPU cores and lots for files to be downloaded, then using ProcessPoolExecutor would be better. And s3transfer has implemented an interface to use ProcessPoolExecutor

https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py

So I think addding feature flag for selecting thread or process for using s3 that could be determined by user would be better.

Use Case

If we have to download many files and the environment that using aws-cli has enough resources(CPU, Memory, Network bandwitdh) than we can choice that use more CPUs for boost the S3 throughput

Proposed Solution

No response

Other Information

No response

Acknowledgements

  • [ ] I may be able to implement this feature request
  • [ ] This feature might incur a breaking change

CLI version used

2.15.30

Environment details (OS name and version, etc.)

Amazon Linux 2023

kimsehwan96 avatar Aug 07 '24 03:08 kimsehwan96

Thanks for the feature request, we can review with the team. In the meantime can you provide any more details on your use case and the results you're seeing? Have you tried setting any of the S3 configurations documented here to optimize downloads: https://awscli.amazonaws.com/v2/documentation/api/latest/topic/s3-config.html ?

tim-finnigan avatar Aug 14 '24 20:08 tim-finnigan

@tim-finnigan

I reviewed the documentation and tried increasing the max_concurrent_requests to improve performance.

For example, I tested this on a c7g.16xlarge instance, which has a network interface capable of 30Gbps bandwidth. I set max_concurrent_requests to 64, matching the number of vCPUs, but the download speed didn’t improve as much as I expected.

Since s3transfer uses ThreadPoolExecutor by default, it might be helpful to give users the option to use ProcessPoolExecutor. This way, users with more CPU resources available could potentially speed up their downloads.

In my tests, using ProcessPoolExecutor for parallel downloads from S3 with boto3, I was almost able to fully use the 30Gbps bandwidth—something that wasn’t possible with ThreadPoolExecutor.

I think adding an option for ProcessPoolExecutor could help achieve download speeds similar to tools like s5cmd.

kimsehwan96 avatar Aug 15 '24 08:08 kimsehwan96

Hello @kimsehwan96 thanks for the reach out. For information, how are you trying the test? Are there any modifications you have done or modifying the SDK? Any test results, logs, debug logs and steps for replication? Thanks

adev-code avatar Jun 13 '25 20:06 adev-code

@adev-code Hello, I tested it without boto3 sdk code modification, then just create some python code with boto3 and ProcessPoolExecutor. (Here is my article about it written with korean) : https://www.kimsehwan96.com/s3-donwload-sync-async-thread-process-perfomance-comparision/

I think there may be something that has been tested incorrectly.

Also, I tried to change the BoundedExecutor(https://github.com/boto/s3transfer/blob/9a168299c932077e665a618bfa5e2d5e39343745/s3transfer/futures.py#L406-L411) EXECUTOR_CLS with ProcessPoolExecutor but it was not worked. (I tested it about last year)

kimsehwan96 avatar Jun 14 '25 12:06 kimsehwan96

@kimsehwan96, thanks for reply. Would you have a minimal reproducible code?

adev-code avatar Jun 24 '25 20:06 adev-code

@adev-code I changed EXECUTOR_CLS into futures.ProcessPoolExecutor in s3transfer/futures.py to test it with local enviornment(install boto3 and change the local s3transfer dependency codes) but it doesn't work. (https://github.com/boto/s3transfer/blob/f4341ad2ce17f4253de60d4678b97aabfd9d2b81/s3transfer/futures.py#L425-L426)

I think s3trasnfer/processpool.py(https://github.com/boto/s3transfer/blob/develop/s3transfer/processpool.py) is already implemented but it is not used in boto3 s3 related codes, so I think that if someone want to use multi processes to speed up s3 download, it would be better to serve some flags to use it(multi process s3 donwload feature)

kimsehwan96 avatar Jun 25 '25 06:06 kimsehwan96