OTLP exporters don't respect timeout
Discussed further in https://github.com/open-telemetry/opentelemetry-python/issues/2402#issuecomment-1116875919
OTLP exporter as of today
- try to export, with a ~30~ 10 second timeout, fail and wait 2 seconds
- try to export, with a ~30~ 10 second timeout, fail and wait 4 seconds
- try to export, with a ~30~ 10 second timeout, fail and wait 8 seconds
- try to export, with a ~30~ 10 second timeout, fail and wait 16 seconds
- try to export, with a ~30~ 10 second timeout, fail and wait 32 seconds
- try to export, with a ~30~ 10 second timeout, fail and wait 64 seconds and finally give up
In an ideal scenario the BSP has to cancel the export call after 30s irrespective of exporter handling mechanism. But it doesn't do that with 30s export mills timeout passed to via env/arg.
Moving some discussion from the PR here:
def _export( self, data: TypingSequence[SDKDataT], timeout_millis: Optional[float] = None, ) -> ExportResultT: if timeout_millis is not None: timeout_seconds = timeout_millis / 10**3 else: timeout_seconds = self._timeoutHelp me understand this. It is generally self._timeout except when we want to finish work as early as possible force_flush or shutdown in which case we pass the timeout_millis from processor/periodicexportingreader? From our discussion last week, IIRC you said the timeout configured directly on the exporter (via constructor or environment vars) should apply to individual requests, potentially within a series of retries.
cc @srikanthccv
I checked the Go impl (where this is a common pattern) and it actually does something different.
- If the exporter's config timeout (
self._timeoutfor us) is present, it uses that.- Note, if that is longer than the context's timeout is, it will actually use the shorter one
- If not present, it uses the just the parent context
- It doesn't have a different context for individual requests vs retries (retry logic here)
This is really confusing to me because DEADLINE_EXCEEDED is retryable but it will never happen? I'm not really sure what the intention of the spec is. Maybe a few examples can help tease out the details:
BSP has as longer timeout
- user sets
OTEL_EXPORTER_OTLP_TIMEOUT=10000 - they set
OTEL_BSP_EXPORT_TIMEOUT=30000
does the export abort after 10s? Or should it retry until 30 seconds? What if the user calls force flush with 15s timeout?
BSP has a shorter timeout
- user sets
OTEL_EXPORTER_OTLP_TIMEOUT=30000 - they set
OTEL_BSP_EXPORT_TIMEOUT=10000
presumably the BSP should win out here
User passes a longer timeout than either
- user sets
OTEL_EXPORTER_OTLP_TIMEOUT=10000 - they set
OTEL_BSP_EXPORT_TIMEOUT=10000
What if the user calls force flush with a 30s timeout? Should we wait 30s or respect the shorter configured options of BSP/OTLP? Maybe there is an argument for making the timeout params Optional to indicate no deadline from the parent call
Any update on this? Not sure if it's the same bug but I set the timeout to 1 sec and it keeps trying to reach the (inexistent) endpoint:
WARNING - 2024-02-10 00:16:51,218 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 1s.
WARNING - 2024-02-10 00:16:52,227 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 2s.
WARNING - 2024-02-10 00:16:54,233 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 4s.
WARNING - 2024-02-10 00:16:58,238 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 8s.
WARNING - 2024-02-10 00:17:06,245 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 16s.
WARNING - 2024-02-10 00:17:22,255 - opentelemetry.exporter.otlp.proto.grpc.exporter - exporter - Transient error StatusCode.UNAVAILABLE encountered while exporting traces to foo, retrying in 32s.
This behavior is EXTREMELY problematic when collector has worse performance than the application. Which for many is acceptable, after all this is telemetry data.
Especially when we are talking about Lambda, with sub-second invocations and short timeouts.
Current workaround for autoinstrumentation I see is to patch the retries out completely during the cold start:
from opentelemetry.metrics._internal import get_meter_provider
meter_provider = get_meter_provider() # should return the global object
meter_provider._sdk_config.metric_readers[0]._exporter._retryable = lambda _: False