Multiple system metrics return negative values when using DELTA temporality.
Describe your environment
OS: CentOS Python version: 3.9.16 Package version: OTEL 1.25.0
What happened?
The following metrics are found with negative values when configuring to export using DELTA temporality.
- system.network.connections
- process.runtime.cpython.gc_count
- process.runtime.cpython.memory
- process.runtime.cpython.thread_count
Steps to Reproduce
temporality_dict: Dict[type, AggregationTemporality] = {}
for typ in [
Counter,
UpDownCounter,
ObservableCounter,
ObservableCounter,
ObservableUpDownCounter,
ObservableGauge,
Histogram,
]:
temporality_dict[typ] = AggregationTemporality.DELTA
set_meter_provider(MeterProvider([PeriodicExportingMetricReader(ConsoleMetricExporter(preferred_temporality=temporality_dict))], resource=resource))
SystemMetricsInstrumentor().instrument()
Expected Result
Metrics generated with non-negative value
Actual Result
Metrics generated with negative value
Additional context
Sep 11 03:35:10: Descriptor:
Sep 11 03:35:10: -> Name: process.runtime.cpython.memory
Sep 11 03:35:10: -> Description: Runtime cpython memory
Sep 11 03:35:10: -> Unit: bytes
Sep 11 03:35:10: -> DataType: Sum
Sep 11 03:35:10: -> IsMonotonic: false
Sep 11 03:35:10: -> AggregationTemporality: Delta
Sep 11 03:35:10: NumberDataPoints #0
Sep 11 03:35:10: Data point attributes:
Sep 11 03:35:10: -> type: Str(rss)
Sep 11 03:35:10: StartTimestamp: 2024-09-11 03:34:09.999252322 +0000 UTC
Sep 11 03:35:10: Timestamp: 2024-09-11 03:35:10.010880332 +0000 UTC
Sep 11 03:35:10: Value: -1081344
Sep 11 03:35:10: NumberDataPoints #1
Sep 11 03:35:10: Data point attributes:
Sep 11 03:35:10: -> type: Str(vms)
Sep 11 03:35:10: StartTimestamp: 2024-09-11 03:34:09.999252322 +0000 UTC
Sep 11 03:35:10: Timestamp: 2024-09-11 03:35:10.010880332 +0000 UTC
Sep 11 03:35:10: Value: -1048576
Would you like to implement a fix?
Yes
It seems some system metrics are not created with a proper instrument. For example, process.runtime.gc_count uses observable_counter which is monotonic and cumulative, but according to python doc, observable_gauge should be the right one to be used.
gc.get_count() Return the current collection counts as a tuple of (count0, count1, count2).
I'm working on a PR to fix this issue. If the previous assumption is not right, please let me know.
Update from https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2865#issuecomment-2379723941, this is working as intended but we should write down the full resolution.
Summary
After discussions with the maintainers, it was decided to retain the current implementation using UpDownCounter for recording the specified metrics, unless a concrete example demonstrates that its use is inappropriate.
The rationale behind using UpDownCounter is that the metrics are additive.
Asynchronous UpDownCounter is an asynchronous Instrument which reports additive value(s) (e.g. the process heap size - it makes sense to report the heap size from multiple processes and sum them up, so we get the total heap usage) when the instrument is being observed.
One example is that on Kubernetes, kube-apiserver can collect metrics from nodes, and sum up to get the aggregated network connections.
Solution
To report correct metric value to metric system that only accepts DELTA metrics, e.g. CloudWatch Metrics. Instead of changing metric instrument types, I was advised to add views using LastValueAggregation to workaround this issue, and it works for me. From result perspective, LastValueAggregation basically changes the metric instrument type to gauge with the observed value.
The following code snippet is what I did for the "problematic" metrics:
system_metrics_scope_name = "opentelemetry.instrumentation.system_metrics"
views.append(
View(
instrument_name="system.network.connections",
meter_name=system_metrics_scope_name,
aggregation=LastValueAggregation(),
)
)
views.append(
View(
instrument_name="process.open_file_descriptor.count",
meter_name=system_metrics_scope_name,
aggregation=LastValueAggregation(),
)
)
views.append(
View(
instrument_name="process.runtime.*.memory",
meter_name=system_metrics_scope_name,
aggregation=LastValueAggregation(),
)
)
views.append(
View(
instrument_name="process.runtime.*.gc_count",
meter_name=system_metrics_scope_name,
aggregation=LastValueAggregation(),
)
)
views.append(
View(
instrument_name="process.runtime.*.thread_count",
meter_name=system_metrics_scope_name,
aggregation=LastValueAggregation(),
)
)
Close this issue