Handle the transient errors in efcore when using Cosmos DB
It is possible to encounter transient errors when using efcore with Cosmos DB, and it seems that the efcore cannot handle such errors (like 503 Service Unavailable) now. Is it better if we can provide some retries for the transient error codes?
As this document shows, there are some transient error codes (408, 410, 429, 449, and 503) that we can retry to make it more resilient: https://docs.microsoft.com/en-us/azure/cosmos-db/sql/conceptual-resilient-sdk-applications#should-my-application-retry-on-errors
@rmt2021 see the conversation in https://github.com/dotnet/efcore/issues/8443#issuecomment-459969829. As the docs you linked to specify, the error codes you listed should be retried by the SDK, which the EF provider uses. Are you seeing a different behavior?
Thanks for sharing this conversation @roji.
The SDK will only retry on the error codes when there are multiple available regions. For example, if 503 happens, and there is only one region for the Cosmos DB, then the retry will not happen, as the SDK code written: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/b713ce4cb3e482175a8a6a9b8fc7051d9c7b5e91/Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs#L373-L378
I notice that efcore actually has this code snippet to retry for transient error codes: https://github.com/dotnet/efcore/blob/0d1d602d72fefe14f12a86410ec70394ec8151e0/src/EFCore.Cosmos/Storage/Internal/CosmosExecutionStrategy.cs#L105-L107
which is to retry for 503 (ServiceUnavailable) and 429 (TooManyRequests).
I believe we should also retry on 408, 410 and 449.