fix: retry Get for 500 and 503 error from GCE metadata server
Currently, getting metadata retries only for a transport error, but doesn't retry for retryable status code.
GCE metadata doc suggests retrying for 503. In addition, GCE metadata server also returns 500 error for intermittent unavailability.
If this happens in token refresh, an intermittent 500 or 503 error is propagated as RefreshError. RefreshError is not retryable in python-api-core library. So, just one time of an intermittent retryable error with GCE metadata leads to GCP API call failure.
To mitigate this, I asked retry of RefreshError at [1], but the team suggested adding retry at auth layer [2].
[1] https://github.com/googleapis/python-api-core/issues/312 [2] https://github.com/googleapis/python-api-core/pull/313#issuecomment-978006491
link with issue #980
#980
@arithmetic1728 The #980 is about token endpoint, while this change addresses retries to Metadata endpoint
@arithmetic1728 i think we need first to address the #980 and add Retryable interface, then we can leverage that here to address Metadata retries. Most likely we will opt for retryable errors passed to client instead of actual retries in the library.
@baeminbo Hi, could you, please, provide any stats on Metadata service errors that you are trying to mitigate?