S3 timing out on CI (tests using LocalStack)
Describe the bug
AWS SDK 2.21.12+ causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations. While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.
Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 2.20.63. On 2.20.63, all S3 operations take milliseconds, as expected (even on CI).
Expected Behavior
S3 operations should not time out during CI builds.
Current Behavior
S3 operations time out during CI builds.
software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
at [...]
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)
Reproduction Steps
We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:
s3Client = S3Client.builder()
.endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
.credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
.region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
.build();
// ...
// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));
// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));
Below is a simplified version of our .gitlab-ci.yml:
variables:
MAVEN_CLI_OPTS: >-
-gs ./ci_settings.xml
-Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository
-Djavax.net.ssl.trustStore=./cacerts
--batch-mode
--errors
image: [...]/maven:3.9.4-amazoncorretto-21
code_quality_analysis:
before_script:
- >
export DOCKER_AUTH_CONFIG=[...]
script:
- > # Run tests
mvn clean verify $MAVEN_CLI_OPTS
-T 2C
# ...
I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:
<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
<proxies>
<proxy>
<id>[...]-http</id>
<active>true</active>
<protocol>http</protocol>
<port>8080</port>
<host>[...]</host>
<nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
</proxy>
<proxy>
<id>[...]-https</id>
<active>true</active>
<protocol>https</protocol>
<port>8080</port>
<host>[...]</host>
<nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
</proxy>
</proxies>
</settings>
Possible Solution
it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12. I would assume some code introduced between those version is the reason for this issue.
Additional Information/Context
I noticed the bug when upgrading spring-cloud-aws-dependencies from 3.0.1 to 3.0.3+.
spring-cloud-aws-dependencies:3.0.1 depends on AWS SDK 2.20.63, while spring-cloud-aws-dependencies:3.0.3 depends on 2.21.12.
I noticed the bug appears on spring-cloud-aws-dependencies:3.0.3, but disappears when I manually downgrade AWS SDK to 2.21.12. I tried upgrading AWS SDK to 2.25.17, but the problem persists. Therefore, it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12.
The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here.
AWS Java SDK version used
2.21.12
JDK version used
maven:3.9.4-amazoncorretto-21 (docker image)
Operating System and version
Alpine Linux 3.16
After some more research, I determined that the issue starts with version 2.21.0.
On version 2.20.162, everything works correctly, while as soon as I update to 2.21.0, the CI build (and only it) starts timing out.
Could the internal refactor of authentication be to blame here?
For even more information, it seems that the S3 requests are not reaching LocalStack:
https://github.com/localstack/localstack/issues/10565#issuecomment-2026836909
Yes, 2.21.0 included a major authentication refactor. Localstack, being a third-party solution, is not guaranteed to be supported by changes made in the AWS SDK.
You did the recommended action and reached out to Localstack team. Let's keep track of their investigation - https://github.com/localstack/localstack/issues/10565
It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.
Unfortunately, the issue is still a problem. I'm planning to do some more digging soon and will hopefully be able to come up with some additional information.
It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.
This is still a problem and the investigation is ongoing.
It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.
Investigation still ongoing.
The issue turned out to be AWS SDK proxies not supporting IP ranges: https://github.com/aws/aws-sdk-java-v2/issues/5399
This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.