aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

S3 timing out on CI (tests using LocalStack)

Open daniel-frak opened this issue 1 year ago • 9 comments

Describe the bug

AWS SDK 2.21.12+ causes our Gitlab CI runner (an AWS t3.xlarge instance) to time out on S3 operations. While the tests usually take 6-10 min, here they take up to 43 minutes before failing the build.

Importantly, everything works perfectly well on a local machine (the entire build takes about 3.5min). The issues only occur on CI. Furthermore, the issue does not occur on version 2.20.63. On 2.20.63, all S3 operations take milliseconds, as expected (even on CI).

Expected Behavior

S3 operations should not time out during CI builds.

Current Behavior

S3 operations time out during CI builds.

software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Read timed out
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
	at software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:47)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:223)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:83)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
	at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
	at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
	at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
	at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
	at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
	at software.amazon.awssdk.services.s3.DefaultS3Client.createBucket(DefaultS3Client.java:1156)
	at software.amazon.awssdk.services.s3.S3Client.createBucket(S3Client.java:1708)
	at [...]
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Unable to execute HTTP request: Read timed out
	Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 3 failure: Unable to execute HTTP request: Read timed out
	Suppressed: java.lang.NullPointerException: Cannot invoke "java.util.Set.forEach(java.util.function.Consumer)" because the return value of "org.junit.jupiter.api.extension.ExtensionContext$Store.remove(Object, java.lang.Class)" is null
		at org.mockito.junit.jupiter.MockitoExtension.afterEach(MockitoExtension.java:190)

Reproduction Steps

We are running the tests using Maven and creating a LocalStack instance using TestContainers (1.19.7). The relevant code is:

s3Client = S3Client.builder()
  .endpointOverride(LOCAL_STACK_CONTAINER.getEndpointOverride(LocalStackContainer.Service.S3))
  .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(
    LOCAL_STACK_CONTAINER.getAccessKey(), LOCAL_STACK_CONTAINER.getSecretKey())))
  .region(Region.of(LOCAL_STACK_CONTAINER.getRegion()))
  .build();

// ...

// It times out here (@BeforeEach):
s3Client.createBucket(request -> request.bucket(bucket));

// ...And here (@AfterEach):
s3Client.deleteBucket(request -> request.bucket(bucket));

Below is a simplified version of our .gitlab-ci.yml:

variables:
  MAVEN_CLI_OPTS: >-
    -gs ./ci_settings.xml 
    -Dmaven.repo.local=${CI_PROJECT_DIR}/.m2/repository 
    -Djavax.net.ssl.trustStore=./cacerts
    --batch-mode
    --errors

image: [...]/maven:3.9.4-amazoncorretto-21

code_quality_analysis:
  before_script:
    - >
      export DOCKER_AUTH_CONFIG=[...]
script:
    - > # Run tests
      mvn clean verify $MAVEN_CLI_OPTS
      -T 2C
# ...

I don't think our ci_settings.xml proxies are to blame, but just in case here's a simplified version of those:

<settings xmlns="http://maven.apache.org/SETTINGS/1.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.2.0 https://maven.apache.org/xsd/settings-1.2.0.xsd">
  
  <proxies>
    <proxy>
      <id>[...]-http</id>
      <active>true</active>
      <protocol>http</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
    <proxy>
      <id>[...]-https</id>
      <active>true</active>
      <protocol>https</protocol>
      <port>8080</port>
      <host>[...]</host>
      <nonProxyHosts>localhost|127.0.0.1|[...]</nonProxyHosts>
    </proxy>
  </proxies>
</settings>

Possible Solution

it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12. I would assume some code introduced between those version is the reason for this issue.

Additional Information/Context

I noticed the bug when upgrading spring-cloud-aws-dependencies from 3.0.1 to 3.0.3+. spring-cloud-aws-dependencies:3.0.1 depends on AWS SDK 2.20.63, while spring-cloud-aws-dependencies:3.0.3 depends on 2.21.12.

I noticed the bug appears on spring-cloud-aws-dependencies:3.0.3, but disappears when I manually downgrade AWS SDK to 2.21.12. I tried upgrading AWS SDK to 2.25.17, but the problem persists. Therefore, it seems the issue was introduced in AWS SDK somewhere between 2.20.63 and 2.21.12.

The runner instance shows only mild memory and CPU consumption, so it doesn't seem like lack of resources is the problem here.

AWS Java SDK version used

2.21.12

JDK version used

maven:3.9.4-amazoncorretto-21 (docker image)

Operating System and version

Alpine Linux 3.16

daniel-frak avatar Mar 26 '24 13:03 daniel-frak

After some more research, I determined that the issue starts with version 2.21.0.

On version 2.20.162, everything works correctly, while as soon as I update to 2.21.0, the CI build (and only it) starts timing out.

Could the internal refactor of authentication be to blame here?

daniel-frak avatar Mar 26 '24 14:03 daniel-frak

For even more information, it seems that the S3 requests are not reaching LocalStack:

https://github.com/localstack/localstack/issues/10565#issuecomment-2026836909

daniel-frak avatar Mar 29 '24 08:03 daniel-frak

Yes, 2.21.0 included a major authentication refactor. Localstack, being a third-party solution, is not guaranteed to be supported by changes made in the AWS SDK.

You did the recommended action and reached out to Localstack team. Let's keep track of their investigation - https://github.com/localstack/localstack/issues/10565

debora-ito avatar Apr 03 '24 18:04 debora-ito

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar Apr 13 '24 21:04 github-actions[bot]

Unfortunately, the issue is still a problem. I'm planning to do some more digging soon and will hopefully be able to come up with some additional information.

daniel-frak avatar Apr 16 '24 05:04 daniel-frak

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar Apr 26 '24 06:04 github-actions[bot]

This is still a problem and the investigation is ongoing.

daniel-frak avatar Apr 26 '24 06:04 daniel-frak

It looks like this issue has not been active for more than five days. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please add a comment to prevent automatic closure, or if the issue is already closed please feel free to reopen it.

github-actions[bot] avatar May 06 '24 09:05 github-actions[bot]

Investigation still ongoing.

daniel-frak avatar May 06 '24 12:05 daniel-frak

The issue turned out to be AWS SDK proxies not supporting IP ranges: https://github.com/aws/aws-sdk-java-v2/issues/5399

daniel-frak avatar Jul 24 '24 06:07 daniel-frak

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Jul 24 '24 06:07 github-actions[bot]