aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

Add ServiceCallDuration to ApiCallAttempt metrics even for Http timeouts

Open swaranga opened this issue 4 years ago • 6 comments

Today, if a call encounters Http timeout, the metric does not include how much time was spent in that particular attempt. Given that SDK metrics are really to debug such issues, it will be very helpful if the ApiCallAttempt metric collection included the ServiceCallDuration metric.

I have reproduced this issue the following way:

A simple program to make an SDK call with a very small timeout guaranteed to throw before the call can finish:

KinesisClient kc = ...;
ks.listStreams(
  ListStreamsRequest.builder()
    .overrideConfiguration(c -> c..apiCallAttemptTimeout(Duration.ofMillis(5))) // almost guaranteed to timeout
    .build()
);

Running this code logs the following metrics:

MetricCollection(
  name=ApiCall, 
  metrics=[
    MetricRecord(metric=MarshallingDuration, value=PT0.04227928S), 
    MetricRecord(metric=RetryCount, value=3), 
    MetricRecord(metric=ApiCallSuccessful, value=false), 
    MetricRecord(metric=OperationName, value=ListStreams), 
    MetricRecord(metric=ApiCallDuration, value=PT0.754417876S), 
    MetricRecord(metric=CredentialsFetchDuration, value=PT1.542674972S), 
    MetricRecord(metric=ServiceId, value=Kinesis)
  ], 
  children=[
    MetricCollection(
      name=ApiCallAttempt, 
      metrics=[
        MetricRecord(metric=BackoffDelayDuration, value=PT0S), 
        MetricRecord(metric=SigningDuration, value=PT0.019085065S)
      ], 
      children=[
        MetricCollection(
          name=HttpClient, 
          metrics=[
            MetricRecord(metric=HttpClientName, value=Apache)
          ], 
          children=[]
        )
      ]
    ), 
    MetricCollection(
      name=ApiCallAttempt, 
      metrics=[
        MetricRecord(metric=BackoffDelayDuration, value=PT0.091S), 
        MetricRecord(metric=SigningDuration, value=PT0.001483931S)
      ], 
      children=[
        MetricCollection(
          name=HttpClient, 
          metrics=[
            MetricRecord(metric=HttpClientName, value=Apache)
          ], 
          children=[]
        )
      ]
    ), 
    MetricCollection(
      name=ApiCallAttempt, 
      metrics=[
        MetricRecord(metric=BackoffDelayDuration, value=PT0.172S), 
        MetricRecord(metric=SigningDuration, value=PT0.001785582S)
      ], 
      children=[
        MetricCollection(
          name=HttpClient, 
          metrics=[
            MetricRecord(metric=HttpClientName, value=Apache)
          ], 
          children=[]
        )
      ]
    ), 
    MetricCollection(
      name=ApiCallAttempt, 
      metrics=[
        MetricRecord(metric=BackoffDelayDuration, value=PT0.205S), 
        MetricRecord(metric=SigningDuration, value=PT0.001690817S)
      ], 
      children=[
        MetricCollection(
          name=HttpClient, 
          metrics=[
            MetricRecord(metric=HttpClientName, value=Apache)
          ], 
          children=[]
        )
      ]
    )
  ]
)

We can see that the ApiCallAttempt metrics does not include the ServiceCallDuration.

swaranga avatar May 19 '21 00:05 swaranga

Hi @swaranga this may be a bug, I'll investigate.

debora-ito avatar May 20 '21 16:05 debora-ito

Ping. Any updates?

swaranga avatar Jul 14 '21 00:07 swaranga

@swaranga I ran some local tests and noticed that the ServiceCallDuration metric does show up in some cases. For example, these were generated by one of my calls:

 
MetricCollection(
    name=ApiCall, 
    metrics=[
        MetricRecord(metric=MarshallingDuration, value=PT0.018306514S), 
        MetricRecord(metric=RetryCount, value=3), 
        MetricRecord(metric=ApiCallSuccessful, value=false), 
        MetricRecord(metric=OperationName, value=GetObject), 
        MetricRecord(metric=ApiCallDuration, value=PT2.114357895S), 
        MetricRecord(metric=CredentialsFetchDuration, value=PT0.036746435S), 
        MetricRecord(metric=ServiceId, value=S3)],

    children=[
        MetricCollection(
            name=ApiCallAttempt, 
            metrics=[
                MetricRecord(metric=BackoffDelayDuration, value=PT0S), 
                MetricRecord(metric=SigningDuration, value=PT0.04171945S)], 
            children=[]
        ), 
            
        MetricCollection(
            name=ApiCallAttempt, 
            metrics=[
                MetricRecord(metric=BackoffDelayDuration, value=PT0.08S), 
                MetricRecord(metric=SigningDuration, value=PT0.001068345S), 
                MetricRecord(metric=ServiceCallDuration, value=PT1.657009986S)], 
            children=[
                MetricCollection(
                    name=HttpClient, 
                    metrics=[
                        MetricRecord(metric=MaxConcurrency, value=50), 
                        MetricRecord(metric=AvailableConcurrency, value=0), 
                        MetricRecord(metric=LeasedConcurrency, value=1), 
                        MetricRecord(metric=PendingConcurrencyAcquires, value=0), 
                        MetricRecord(metric=HttpClientName, value=Apache)], 
                    children=[])]
        ),
 
        MetricCollection(
            name=ApiCallAttempt, 
            metrics=[
                MetricRecord(metric=BackoffDelayDuration, value=PT0.047S), 
                MetricRecord(metric=SigningDuration, value=PT0.00076821S)], 
            children=[
                MetricCollection(
                    name=HttpClient, 
                    metrics=[MetricRecord(metric=HttpClientName, value=Apache)], children=[])]
        ), 

        MetricCollection(
            name=ApiCallAttempt, 
            metrics=[
                MetricRecord(metric=BackoffDelayDuration, value=PT0.19S), 
                MetricRecord(metric=SigningDuration, value=PT0.002128439S)], 
            children=[
                MetricCollection(
                    name=HttpClient, 
                    metrics=[MetricRecord(metric=HttpClientName, value=Apache)], children=[])]
        )
    ]
)

Depending on when the timeout occurred in the call attempt, the ServiceCallDuration metric gets generated or not.

Let us know if this makes sense.

debora-ito avatar Jul 15 '21 00:07 debora-ito

@debora-ito - not sure I would say it makes sense. I can consistently reproduce it with the following program:

import java.time.Duration;

import software.amazon.awssdk.auth.credentials.AwsCredentials;
import software.amazon.awssdk.auth.credentials.AwsCredentialsProvider;
import software.amazon.awssdk.metrics.MetricCollection;
import software.amazon.awssdk.metrics.MetricPublisher;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.kinesis.KinesisClient;
import software.amazon.awssdk.services.kinesis.model.ListStreamsRequest;

public final class App {
    public static void main(String[] args) {
        KinesisClient kinesisClient = KinesisClient.builder()
                                        .credentialsProvider(new AwsCredentialsProvider(){
                                            @Override
                                            public AwsCredentials resolveCredentials() {
                                                return new AwsCredentials(){

                                                    @Override
                                                    public String accessKeyId() {
                                                        return "asdf"; // does not matter
                                                    }

                                                    @Override
                                                    public String secretAccessKey() {
                                                        return "asdf"; // does not matter
                                                    }
                                                };
                                            }
                                            
                                        }).region(Region.US_EAST_1)
                                        .overrideConfiguration(c -> c.addMetricPublisher(new MetricPublisher() {

                                            @Override
                                            public void publish(MetricCollection metricCollection) {
                                                System.out.println(metricCollection);
                                            }

                                            @Override
                                            public void close() { }
                                            
                                        }))
                                        .build();


        try {
            kinesisClient.listStreams(
            ListStreamsRequest.builder()
                .overrideConfiguration(c -> c.apiCallAttemptTimeout(Duration.ofMillis(25))) // almost guaranteed to timeout
                .build()
            );
        } catch (Exception ex) {
            System.out.println(ex);
        }        
    }
}

Even in your example, I can see the service call duration was not added for each ApiCallAttempt metric collection - I can only see it in the second attempt.

Obviously there seems to be a code path under which the ServiceCallDuration metrics is not collected. I am not familiar enough with the SDK codebase to be able to easily say where the bug lies which is why I have opened this issue.

swaranga avatar Jul 17 '21 23:07 swaranga

~~I also request that the "closing-soon" label be removed until we get to the bottom of this.~~

swaranga avatar Jul 17 '21 23:07 swaranga

Ping. Any updates on this? Actually, do you take pull requests?

swaranga avatar Oct 12 '21 18:10 swaranga