opentelemetry-python-contrib icon indicating copy to clipboard operation
opentelemetry-python-contrib copied to clipboard

Multiple system metrics return negative values when using DELTA temporality.

Open bjrara opened this issue 1 year ago • 2 comments

Describe your environment

OS: CentOS Python version: 3.9.16 Package version: OTEL 1.25.0

What happened?

The following metrics are found with negative values when configuring to export using DELTA temporality.

  • system.network.connections
  • process.runtime.cpython.gc_count
  • process.runtime.cpython.memory
  • process.runtime.cpython.thread_count

Steps to Reproduce

temporality_dict: Dict[type, AggregationTemporality] = {}
for typ in [
    Counter,
    UpDownCounter,
    ObservableCounter,
    ObservableCounter,
    ObservableUpDownCounter,
    ObservableGauge,
    Histogram,
]:
    temporality_dict[typ] = AggregationTemporality.DELTA

set_meter_provider(MeterProvider([PeriodicExportingMetricReader(ConsoleMetricExporter(preferred_temporality=temporality_dict))], resource=resource))
SystemMetricsInstrumentor().instrument()

Expected Result

Metrics generated with non-negative value

Actual Result

Metrics generated with negative value

Additional context

Sep 11 03:35:10: Descriptor:
Sep 11 03:35:10:      -> Name: process.runtime.cpython.memory
Sep 11 03:35:10:      -> Description: Runtime cpython memory
Sep 11 03:35:10:      -> Unit: bytes
Sep 11 03:35:10:      -> DataType: Sum
Sep 11 03:35:10:      -> IsMonotonic: false
Sep 11 03:35:10:      -> AggregationTemporality: Delta
Sep 11 03:35:10: NumberDataPoints #0
Sep 11 03:35:10: Data point attributes:
Sep 11 03:35:10:      -> type: Str(rss)
Sep 11 03:35:10: StartTimestamp: 2024-09-11 03:34:09.999252322 +0000 UTC
Sep 11 03:35:10: Timestamp: 2024-09-11 03:35:10.010880332 +0000 UTC
Sep 11 03:35:10: Value: -1081344
Sep 11 03:35:10: NumberDataPoints #1
Sep 11 03:35:10: Data point attributes:
Sep 11 03:35:10:      -> type: Str(vms)
Sep 11 03:35:10: StartTimestamp: 2024-09-11 03:34:09.999252322 +0000 UTC
Sep 11 03:35:10: Timestamp: 2024-09-11 03:35:10.010880332 +0000 UTC
Sep 11 03:35:10: Value: -1048576

Would you like to implement a fix?

Yes

bjrara avatar Sep 11 '24 04:09 bjrara

It seems some system metrics are not created with a proper instrument. For example, process.runtime.gc_count uses observable_counter which is monotonic and cumulative, but according to python doc, observable_gauge should be the right one to be used.

gc.get_count() Return the current collection counts as a tuple of (count0, count1, count2).

bjrara avatar Sep 11 '24 17:09 bjrara

I'm working on a PR to fix this issue. If the previous assumption is not right, please let me know.

bjrara avatar Sep 11 '24 17:09 bjrara

Update from https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2865#issuecomment-2379723941, this is working as intended but we should write down the full resolution.

aabmass avatar Sep 30 '24 14:09 aabmass

Summary

After discussions with the maintainers, it was decided to retain the current implementation using UpDownCounter for recording the specified metrics, unless a concrete example demonstrates that its use is inappropriate.

The rationale behind using UpDownCounter is that the metrics are additive.

Asynchronous UpDownCounter is an asynchronous Instrument which reports additive value(s) (e.g. the process heap size - it makes sense to report the heap size from multiple processes and sum them up, so we get the total heap usage) when the instrument is being observed.

One example is that on Kubernetes, kube-apiserver can collect metrics from nodes, and sum up to get the aggregated network connections.

Solution

To report correct metric value to metric system that only accepts DELTA metrics, e.g. CloudWatch Metrics. Instead of changing metric instrument types, I was advised to add views using LastValueAggregation to workaround this issue, and it works for me. From result perspective, LastValueAggregation basically changes the metric instrument type to gauge with the observed value.

The following code snippet is what I did for the "problematic" metrics:

    system_metrics_scope_name = "opentelemetry.instrumentation.system_metrics"
    views.append(
        View(
            instrument_name="system.network.connections",
            meter_name=system_metrics_scope_name,
            aggregation=LastValueAggregation(),
        )
    )
    views.append(
        View(
            instrument_name="process.open_file_descriptor.count",
            meter_name=system_metrics_scope_name,
            aggregation=LastValueAggregation(),
        )
    )
    views.append(
        View(
            instrument_name="process.runtime.*.memory",
            meter_name=system_metrics_scope_name,
            aggregation=LastValueAggregation(),
        )
    )
    views.append(
        View(
            instrument_name="process.runtime.*.gc_count",
            meter_name=system_metrics_scope_name,
            aggregation=LastValueAggregation(),
        )
    )
    views.append(
        View(
            instrument_name="process.runtime.*.thread_count",
            meter_name=system_metrics_scope_name,
            aggregation=LastValueAggregation(),
        )
    )

bjrara avatar Oct 01 '24 18:10 bjrara

Close this issue

bjrara avatar Oct 01 '24 18:10 bjrara