opentelemetry-python OTLP exporters don't respect timeout

Discussed further in https://github.com/open-telemetry/opentelemetry-python/issues/2402#issuecomment-1116875919

OTLP exporter as of today

try to export, with a ~30~ 10 second timeout, fail and wait 2 seconds

try to export, with a ~30~ 10 second timeout, fail and wait 4 seconds

try to export, with a ~30~ 10 second timeout, fail and wait 8 seconds

try to export, with a ~30~ 10 second timeout, fail and wait 16 seconds

try to export, with a ~30~ 10 second timeout, fail and wait 32 seconds

try to export, with a ~30~ 10 second timeout, fail and wait 64 seconds and finally give up

In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.

May 05 '22 22:05 aabmass

Moving some discussion from the PR here:

    def _export(
        self,
        data: TypingSequence[SDKDataT],
        timeout_millis: Optional[float] = None,
    ) -> ExportResultT:
        if timeout_millis is not None:
            timeout_seconds = timeout_millis / 10**3
        else:
            timeout_seconds = self._timeout
Help me understand this. It is generally self._timeout except when we want to finish work as early as possible force_flush or shutdown in which case we pass the timeout_millis from processor/periodicexportingreader? From our discussion last week, IIRC you said the timeout configured directly on the exporter (via constructor or environment vars) should apply to individual requests, potentially within a series of retries.

cc @srikanthccv

I checked the Go impl (where this is a common pattern) and it actually does something different.

If the exporter's config timeout (self._timeout for us) is present, it uses that.
- Note, if that is longer than the context's timeout is, it will actually use the shorter one
If not present, it uses the just the parent context
It doesn't have a different context for individual requests vs retries (retry logic here)

This is really confusing to me because DEADLINE_EXCEEDED is retryable but it will never happen? I'm not really sure what the intention of the spec is. Maybe a few examples can help tease out the details:

BSP has as longer timeout

user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
they set OTEL_BSP_EXPORT_TIMEOUT=30000

does the export abort after 10s? Or should it retry until 30 seconds? What if the user calls force flush with 15s timeout?

BSP has a shorter timeout

user sets OTEL_EXPORTER_OTLP_TIMEOUT=30000
they set OTEL_BSP_EXPORT_TIMEOUT=10000

presumably the BSP should win out here

User passes a longer timeout than either

user sets OTEL_EXPORTER_OTLP_TIMEOUT=10000
they set OTEL_BSP_EXPORT_TIMEOUT=10000

What if the user calls force flush with a 30s timeout? Should we wait 30s or respect the shorter configured options of BSP/OTLP? Maybe there is an argument for making the timeout params Optional to indicate no deadline from the parent call

May 06 '22 03:05 aabmass

Any update on this? Not sure if it's the same bug but I set the timeout to 1 sec and it keeps trying to reach the (inexistent) endpoint:

WARNING - 2024-02-10 00:16:51,218 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 1s.
WARNING - 2024-02-10 00:16:52,227 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 2s.
WARNING - 2024-02-10 00:16:54,233 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 4s.
WARNING - 2024-02-10 00:16:58,238 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 8s.
WARNING - 2024-02-10 00:17:06,245 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 16s.
WARNING - 2024-02-10 00:17:22,255 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 32s.

Feb 09 '24 22:02 gsakkis

This behavior is EXTREMELY problematic when collector has worse performance than the application. Which for many is acceptable, after all this is telemetry data.

Especially when we are talking about Lambda, with sub-second invocations and short timeouts.

Current workaround for autoinstrumentation I see is to patch the retries out completely during the cold start:

from opentelemetry.metrics._internal import get_meter_provider
meter_provider = get_meter_provider()  # should return the global object
meter_provider._sdk_config.metric_readers[0]._exporter._retryable = lambda _: False

Mar 19 '24 21:03 havrylenkok