Add ServiceCallDuration to ApiCallAttempt metrics even for Http timeouts
Today, if a call encounters Http timeout, the metric does not include how much time was spent in that particular attempt. Given that SDK metrics are really to debug such issues, it will be very helpful if the ApiCallAttempt metric collection included the ServiceCallDuration metric.
I have reproduced this issue the following way:
A simple program to make an SDK call with a very small timeout guaranteed to throw before the call can finish:
KinesisClient kc = ...;
ks.listStreams(
ListStreamsRequest.builder()
.overrideConfiguration(c -> c..apiCallAttemptTimeout(Duration.ofMillis(5))) // almost guaranteed to timeout
.build()
);
Running this code logs the following metrics:
MetricCollection(
name=ApiCall,
metrics=[
MetricRecord(metric=MarshallingDuration, value=PT0.04227928S),
MetricRecord(metric=RetryCount, value=3),
MetricRecord(metric=ApiCallSuccessful, value=false),
MetricRecord(metric=OperationName, value=ListStreams),
MetricRecord(metric=ApiCallDuration, value=PT0.754417876S),
MetricRecord(metric=CredentialsFetchDuration, value=PT1.542674972S),
MetricRecord(metric=ServiceId, value=Kinesis)
],
children=[
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0S),
MetricRecord(metric=SigningDuration, value=PT0.019085065S)
],
children=[
MetricCollection(
name=HttpClient,
metrics=[
MetricRecord(metric=HttpClientName, value=Apache)
],
children=[]
)
]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.091S),
MetricRecord(metric=SigningDuration, value=PT0.001483931S)
],
children=[
MetricCollection(
name=HttpClient,
metrics=[
MetricRecord(metric=HttpClientName, value=Apache)
],
children=[]
)
]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.172S),
MetricRecord(metric=SigningDuration, value=PT0.001785582S)
],
children=[
MetricCollection(
name=HttpClient,
metrics=[
MetricRecord(metric=HttpClientName, value=Apache)
],
children=[]
)
]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.205S),
MetricRecord(metric=SigningDuration, value=PT0.001690817S)
],
children=[
MetricCollection(
name=HttpClient,
metrics=[
MetricRecord(metric=HttpClientName, value=Apache)
],
children=[]
)
]
)
]
)
We can see that the ApiCallAttempt metrics does not include the ServiceCallDuration.
Hi @swaranga this may be a bug, I'll investigate.
Ping. Any updates?
@swaranga I ran some local tests and noticed that the ServiceCallDuration metric does show up in some cases. For example, these were generated by one of my calls:
MetricCollection(
name=ApiCall,
metrics=[
MetricRecord(metric=MarshallingDuration, value=PT0.018306514S),
MetricRecord(metric=RetryCount, value=3),
MetricRecord(metric=ApiCallSuccessful, value=false),
MetricRecord(metric=OperationName, value=GetObject),
MetricRecord(metric=ApiCallDuration, value=PT2.114357895S),
MetricRecord(metric=CredentialsFetchDuration, value=PT0.036746435S),
MetricRecord(metric=ServiceId, value=S3)],
children=[
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0S),
MetricRecord(metric=SigningDuration, value=PT0.04171945S)],
children=[]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.08S),
MetricRecord(metric=SigningDuration, value=PT0.001068345S),
MetricRecord(metric=ServiceCallDuration, value=PT1.657009986S)],
children=[
MetricCollection(
name=HttpClient,
metrics=[
MetricRecord(metric=MaxConcurrency, value=50),
MetricRecord(metric=AvailableConcurrency, value=0),
MetricRecord(metric=LeasedConcurrency, value=1),
MetricRecord(metric=PendingConcurrencyAcquires, value=0),
MetricRecord(metric=HttpClientName, value=Apache)],
children=[])]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.047S),
MetricRecord(metric=SigningDuration, value=PT0.00076821S)],
children=[
MetricCollection(
name=HttpClient,
metrics=[MetricRecord(metric=HttpClientName, value=Apache)], children=[])]
),
MetricCollection(
name=ApiCallAttempt,
metrics=[
MetricRecord(metric=BackoffDelayDuration, value=PT0.19S),
MetricRecord(metric=SigningDuration, value=PT0.002128439S)],
children=[
MetricCollection(
name=HttpClient,
metrics=[MetricRecord(metric=HttpClientName, value=Apache)], children=[])]
)
]
)
Depending on when the timeout occurred in the call attempt, the ServiceCallDuration metric gets generated or not.
Let us know if this makes sense.
@debora-ito - not sure I would say it makes sense. I can consistently reproduce it with the following program:
import java.time.Duration;
import software.amazon.awssdk.auth.credentials.AwsCredentials;
import software.amazon.awssdk.auth.credentials.AwsCredentialsProvider;
import software.amazon.awssdk.metrics.MetricCollection;
import software.amazon.awssdk.metrics.MetricPublisher;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.kinesis.KinesisClient;
import software.amazon.awssdk.services.kinesis.model.ListStreamsRequest;
public final class App {
public static void main(String[] args) {
KinesisClient kinesisClient = KinesisClient.builder()
.credentialsProvider(new AwsCredentialsProvider(){
@Override
public AwsCredentials resolveCredentials() {
return new AwsCredentials(){
@Override
public String accessKeyId() {
return "asdf"; // does not matter
}
@Override
public String secretAccessKey() {
return "asdf"; // does not matter
}
};
}
}).region(Region.US_EAST_1)
.overrideConfiguration(c -> c.addMetricPublisher(new MetricPublisher() {
@Override
public void publish(MetricCollection metricCollection) {
System.out.println(metricCollection);
}
@Override
public void close() { }
}))
.build();
try {
kinesisClient.listStreams(
ListStreamsRequest.builder()
.overrideConfiguration(c -> c.apiCallAttemptTimeout(Duration.ofMillis(25))) // almost guaranteed to timeout
.build()
);
} catch (Exception ex) {
System.out.println(ex);
}
}
}
Even in your example, I can see the service call duration was not added for each ApiCallAttempt metric collection - I can only see it in the second attempt.
Obviously there seems to be a code path under which the ServiceCallDuration metrics is not collected. I am not familiar enough with the SDK codebase to be able to easily say where the bug lies which is why I have opened this issue.
~~I also request that the "closing-soon" label be removed until we get to the bottom of this.~~
Ping. Any updates on this? Actually, do you take pull requests?