opentelemetry-python icon indicating copy to clipboard operation
opentelemetry-python copied to clipboard

OTLP exporters don't respect timeout

Open aabmass opened this issue 3 years ago • 5 comments

Discussed further in https://github.com/open-telemetry/opentelemetry-python/issues/2402#issuecomment-1116875919

OTLP exporter as of today

  • try to export, with a ~30~ 10 second timeout, fail and wait 2 seconds
  • try to export, with a ~30~ 10 second timeout, fail and wait 4 seconds
  • try to export, with a ~30~ 10 second timeout, fail and wait 8 seconds
  • try to export, with a ~30~ 10 second timeout, fail and wait 16 seconds
  • try to export, with a ~30~ 10 second timeout, fail and wait 32 seconds
  • try to export, with a ~30~ 10 second timeout, fail and wait 64 seconds and finally give up

In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.

aabmass avatar May 05 '22 22:05 aabmass

Moving some discussion from the PR here:

    def _export(
        self,
        data: TypingSequence[SDKDataT],
        timeout_millis: Optional[float] = None,
    ) -> ExportResultT:
        if timeout_millis is not None:
            timeout_seconds = timeout_millis / 10**3
        else:
            timeout_seconds = self._timeout

Help me understand this. It is generally self._timeout except when we want to finish work as early as possible force_flush or shutdown in which case we pass the timeout_millis from processor/periodicexportingreader? From our discussion last week, IIRC you said the timeout configured directly on the exporter (via constructor or environment vars) should apply to individual requests, potentially within a series of retries.

cc @srikanthccv

I checked the Go impl (where this is a common pattern) and it actually does something different.

  • If the exporter's config timeout (self._timeout for us) is present, it uses that.
    • Note, if that is longer than the context's timeout is, it will actually use the shorter one
  • If not present, it uses the just the parent context
  • It doesn't have a different context for individual requests vs retries (retry logic here)

This is really confusing to me because DEADLINE_EXCEEDED is retryable but it will never happen? I'm not really sure what the intention of the spec is. Maybe a few examples can help tease out the details:

BSP has as longer timeout

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
  • they set OTEL_BSP_EXPORT_TIMEOUT=30000

does the export abort after 10s? Or should it retry until 30 seconds? What if the user calls force flush with 15s timeout?

BSP has a shorter timeout

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=30000
  • they set OTEL_BSP_EXPORT_TIMEOUT=10000

presumably the BSP should win out here

User passes a longer timeout than either

  • user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
  • they set OTEL_BSP_EXPORT_TIMEOUT=10000

What if the user calls force flush with a 30s timeout? Should we wait 30s or respect the shorter configured options of BSP/OTLP? Maybe there is an argument for making the timeout params Optional to indicate no deadline from the parent call

aabmass avatar May 06 '22 03:05 aabmass

Any update on this? Not sure if it's the same bug but I set the timeout to 1 sec and it keeps trying to reach the (inexistent) endpoint:

WARNING - 2024-02-10 00:16:51,218 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 1s.
WARNING - 2024-02-10 00:16:52,227 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 2s.
WARNING - 2024-02-10 00:16:54,233 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 4s.
WARNING - 2024-02-10 00:16:58,238 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 8s.
WARNING - 2024-02-10 00:17:06,245 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 16s.
WARNING - 2024-02-10 00:17:22,255 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 32s.

gsakkis avatar Feb 09 '24 22:02 gsakkis

This behavior is EXTREMELY problematic when collector has worse performance than the application. Which for many is acceptable, after all this is telemetry data.

Especially when we are talking about Lambda, with sub-second invocations and short timeouts.

Current workaround for autoinstrumentation I see is to patch the retries out completely during the cold start:

from opentelemetry.metrics._internal import get_meter_provider
meter_provider = get_meter_provider()  # should return the global object
meter_provider._sdk_config.metric_readers[0]._exporter._retryable = lambda _: False

havrylenkok avatar Mar 19 '24 21:03 havrylenkok