opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

Incorrect StartTimeUnixNano for ExponentialHistogram with Delta Temporality after gRPC connection drop

Open alexchowle opened this issue 1 year ago • 10 comments

Describe your environment

OS: CentOS 7 Python version: 3.11.1 SDK version: 1.25.0 API version: 1.25.0

What happened?

As part of investigating https://github.com/open-telemetry/opentelemetry-python/issues/3971 I found that the loss and subsequent re-establishment of a gRPC connection to a Collector caused the next metric export to have a very old StartTimeUnixNano value

Steps to Reproduce

metrics_temporality = {
        Histogram: AggregationTemporality.DELTA
    }
otlp_metric_exporter = OTLPMetricExporter(endpoint=grpc_collector_address, insecure=True,
                                              preferred_temporality=metrics_temporality, timeout=30_000)
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter,
                                                  export_interval_millis=60_000))
meter_provider = MeterProvider(metric_readers=[metric_reader], resource=_create_resources(),
                                   views=[View(instrument_type=Histogram,
                                               aggregation=ExponentialBucketHistogramAggregation())])
metrics.set_meter_provider(meter_provider)

meter = metrics.get_meter(__name__)

h = meter.create_histogram(
        name='my_histogram', unit='s', description='just my histogram'
    )

h.record(1.0)
time.sleep(360)
h.record(2.0)

Have the OTel Collector drop the client's gRPC connection by setting the max_connection_age to 5m (as documented at https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configgrpc/README.md#server-configuration)

So the first recording happens, then within the 6-minute sleep the Collector drops the client's gRPC connection within 5 minutes, then the client sends a second observation after the 6 minutes has elapsed.

Expected Result

StartTimeUnixNano has a value that is within 60 seconds of the 2nd recording. This is how the Golang SDK works.

Actual Result

StartTimeUnixNano has the TimeUnixNano of observation # 1 i.e. about 6 minutes in the past.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:40.301137322 +0000 UTC
Timestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Count: 1
Sum: 0.021235
Min: 0.021235
Max: 0.021235
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.021235, 0.021235], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Timestamp: 2024-06-14 06:31:53.98886037 +0000 UTC
Count: 1
Sum: 0.018851
Min: 0.018851
Max: 0.018851
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.018851, 0.018851], Count: 1

Additional context

No response

Would you like to implement a fix?

None

alexchowle avatar Jun 15 '24 09:06 alexchowle

I've performed more testing and it seems the gRPC disconnection is irrelevant; the aged StartTimeUnixNano is used regardless.

alexchowle avatar Jun 18 '24 06:06 alexchowle

@alexchowle care to give #3978 a run?

xrmx avatar Jun 26 '24 08:06 xrmx

@alexchowle care to give #3978 a run?

I've ran the reproduction scenario and am still seeing the same issue.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

I'm pretty confident I'm using the correct whl files: image

I'm picking up the new code: image

I'm thinking I'm doing something wrong with my builds/imports of the wheels.

alexchowle avatar Jun 26 '24 19:06 alexchowle

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1


Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

From the data above you are making 2 observations, with the following timestamps:

> Observation # 1
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
> Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
> ...
> Observation # 2
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
> Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
> ...

Observation # 1 Timestamp matches Observation # 2 StartTimestamp: Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC == StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC.

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

For this behavior to change to match the Golang behavior the entire _ExponentialBucketHistogramAggregation would need to be reinstantiated after the first point is collected and before the second point is collected. I wonder if that is what Golang is doing, I imagine something like that happens when the gRPC connection drops, but I am not sure if that behavior is defined somewhere in the spec :shrug:

I'll ask @jmacd for guidance here.

ocelotl avatar Jun 27 '24 18:06 ocelotl

I only noticed this issue because my vendor backend uses StartTimeUnixNano as the observation timestamp. My service has a very low throughput.

alexchowle avatar Jun 27 '24 18:06 alexchowle

Sorry, please disregard my previous answer I need to do more thinking about this issue, and consider the export interval of the periodic exporter, will comment here when I have a better answer :v:

ocelotl avatar Jun 27 '24 18:06 ocelotl

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try :v:

ocelotl avatar Jun 27 '24 19:06 ocelotl

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If your interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

jack-berg avatar Jun 28 '24 22:06 jack-berg

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

Please take a look here

ocelotl avatar Jun 28 '24 23:06 ocelotl

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try ✌️

I've just tried it. This looks very promising:

Observation # 1

2024-06-29T10:12:33.955Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:11:33.902642238 +0000 UTC
Timestamp: 2024-06-29 10:12:33.897396523 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Observation # 2

2024-06-29T10:16:34.014Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-06-29T10:16:34.014Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:16:33.904149745 +0000 UTC
Timestamp: 2024-06-29 10:16:34.00260972 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

alexchowle avatar Jun 29 '24 10:06 alexchowle