StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

RedisTimeoutException after deploying to Azure, recovers only after rebooting client app

Open eggpa opened this issue 3 years ago • 6 comments

When we deploy to Azure we often start seeing StackExchange.Redis.RedisTimeoutException (almost every deployment at least 1 out of 12 instances is down). In very rare cases this might occur after automatic app service restart. This wouldn't be an issue if this was a temporary problem, however Redis client fails to recover and the issue can be fixed only by restarting the service.

Version/service details:

  • StackExchange.Redis 2.5.46
  • ASP.NET web app running on .NET Framework 4.7.2
  • we're using Microsoft.Extensions.Caching.Redis 2.2.0 for caching on Redis
  • web app is deployed on Azure cloud services
  • Azure Cache for Redis with a C0 Standard plan

Azure Cache for Redis metrics:

  • reads and writes way under 1 MB/s and most of the time under 100 kB/s
  • CPU usage is under 5% on average

Here's a shortened list of exceptions we have received:

2022-05-05T08:25:07.7030057Z: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 4, qu: 0, qs: 4, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 1751, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=2,Free=32765,Min=8,Max=32767)
2022-05-05T08:25:13.9394357Z: Timeout performing EVAL, inst: 2, mgr: Inactive, err: never, queue: 13, qu: 0, qs: 13, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 1751, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767)
2022-05-05T08:25:15.8770876Z: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 16, qu: 0, qs: 16, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 1751, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=2,Free=32765,Min=8,Max=32767)
2022-05-05T08:25:23.7251413Z: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 30, qu: 0, qs: 30, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 1751, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=4,Free=32763,Min=8,Max=32767)
2022-05-05T08:25:24.1042269Z: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 30, qu: 0, qs: 30, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 15912, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767)
// same message repeated every few seconds a total of 175 times
2022-05-05T08:35:43.1108319Z: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 30, qu: 0, qs: 30, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 15912, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767)
// app service was restarted manually

The guidelines provided at https://stackexchange.github.io/StackExchange.Redis/Timeouts do not seem to cover this scenario. Is there a way to fix this problem or perhaps there's something we could do to investigate this issue further?

eggpa avatar May 05 '22 13:05 eggpa

I am also getting TimeoutException after trying to send ~20 GET commands immadiately after service start. After putting 5 sec delay on the service start, it works fine ;o

incloudss avatar May 10 '22 19:05 incloudss

The most curious thing here is why it gets into this state - do you have the first exception thrown when things break down? And the config? The server manager shows as inactive like the thread pool there is crashing...somehow. If you're not using the default thread pool config, that's normal to be inactive, but if you are: that's very bad and explains why no work is happening...but we need to understand how it ate it.

NickCraver avatar May 28 '22 19:05 NickCraver

And the config?

We're calling services.AddDistributedRedisCache(o => o.Configuration = connectionString); where connectionString is redacted.redis.cache.windows.net:6380,password=redacted,ssl=True,abortConnect=False which in turn seems to just call ConnectionMultiplexer.ConnectAsync(connectionString). I guess that means it is using default config.

do you have the first exception thrown when things break down?

We a got a few of these starting at 2022-05-30T11:50:40.5164383Z

StackExchange.Redis.RedisTimeoutException: Timeout performing EVAL, inst: 1, mgr: Inactive, err: never, queue: 4, qu: 0, qs: 4, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, clientName: redacted, serverEndpoint: Unspecified/redacted.redis.cache.windows.net:6380, keyHashSlot: 1751, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=3,Free=32764,Min=8,Max=32767)

Then on 2022-05-30T11:50:57.4981297Z a more interesting exception was logged

StackExchange.Redis.RedisConnectionException: SocketFailure on EVAL ---> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
   at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
   at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
   --- End of inner exception stack trace ---
   at System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult)
   at StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\PhysicalConnection.cs:line 858
   --- End of inner exception stack trace ---
   at Microsoft.Extensions.Caching.Redis.RedisExtensions.<HashMemberGetAsync>d__2.MoveNext()
   at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAndRefreshAsync>d__20.MoveNext()
   at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAsync>d__12.MoveNext()
   at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.<GetStringAsync>d__7.MoveNext()

Following this we get many exceptions similar to this one

StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: EVAL; A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond; IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=2,Free=32765,Min=8,Max=32767), Local-CPU: n/a ---> StackExchange.Redis.RedisConnectionException: SocketFailure on redacted.redis.cache.windows.net:6380/Interactive, origin: EndReading, input-buffer: 0, outstanding: 36, last-read: 20s ago, last-write: 0s ago, unanswered-write: 18s ago, keep-alive: 60s, pending: 0, state: ConnectedEstablished, in: 0, ar: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, mgr: Inactive, err: never ---> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
--- End of inner exception stack trace ---
at System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult)
at StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\PhysicalConnection.cs:line 858
--- End of inner exception stack trace ---
--- End of inner exception stack trace ---
at StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource`1 source, Exception unthrownException) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 2000
at Microsoft.Extensions.Caching.Redis.RedisExtensions.<HashMemberGetAsync>d__2.MoveNext()
at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAndRefreshAsync>d__20.MoveNext()
at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAsync>d__12.MoveNext()

Then on 2022-05-30T11:53:25.3884411Z a different exception happens

StackExchange.Redis.RedisConnectionException: SocketFailure on EVAL ---> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
--- End of inner exception stack trace ---
at System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult)
at StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\PhysicalConnection.cs:line 858
--- End of inner exception stack trace ---
at Microsoft.Extensions.Caching.Redis.RedisExtensions.<HashMemberGetAsync>d__2.MoveNext()
at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAndRefreshAsync>d__20.MoveNext()
at Microsoft.Extensions.Caching.Redis.RedisCache.<GetAsync>d__12.MoveNext()
at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.<GetStringAsync>d__7.MoveNext()

After that and up until 2022-05-30T12:19:32.5937214Z it went back to "No connection" exceptions (until manual restart)

StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: EVAL; A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond; IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=2,Free=32765,Min=8,Max=32767), Local-CPU: n/a ---> StackExchange.Redis.RedisConnectionException: SocketFailure on redacted.redis.cache.windows.net:6380/Interactive, origin: EndReading, input-buffer: 0, outstanding: 36, last-read: 20s ago, last-write: 0s ago, unanswered-write: 18s ago, keep-alive: 60s, pending: 0, state: ConnectedEstablished, in: 0, ar: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, mgr: Inactive, err: never ---> System.IO.IOException: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
--- End of inner exception stack trace ---
at System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult)
at StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\PhysicalConnection.cs:line 858
--- End of inner exception stack trace ---
--- End of inner exception stack trace ---
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\ConnectionMultiplexer.cs:line 2047
at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\RedisBase.cs:line 81
at StackExchange.Redis.RedisDatabase.ScriptEvaluate(String script, RedisKey[] keys, RedisValue[] values, CommandFlags flags) in c:\code\StackExchange.Redis\StackExchange.Redis\StackExchange\Redis\RedisDatabase.cs:line 1052
at Microsoft.Extensions.Caching.Redis.RedisExtensions.HashMemberGet(IDatabase cache, String key, String[] members)
at Microsoft.Extensions.Caching.Redis.RedisCache.GetAndRefresh(String key, Boolean getData)
at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.GetString(IDistributedCache cache, String key)

This particular instance did not have many timeout exceptions, but some others had many timeout exceptions mixed in with aforementioned connection exceptions.

ghost avatar May 31 '22 09:05 ghost

I may have found the problem. We are using Microsoft.Extensions.Caching.Redis which depends on StackExchange.Redis.StrongName and that was deprecated like 5 years ago. We're going to update to Microsoft.Extensions.Caching.StackExchangeRedis which is using StackExchange.Redis and then we'll see whether these problems are still happening.

ghost avatar Jun 02 '22 07:06 ghost

@eggpa were you able to get this resolved? We have a very similar issue, although a different dev environment:

  • azure app service node app

we have to manually restart to fix once this occurs for us, and the recycle process after an environment var change causes this for us most often. We are using azure redis service as well here.

chazix avatar Aug 15 '22 08:08 chazix

@chazix yes and no.

First of all, we have to accept that timeouts and connection failures will happen. Even when service starts successfully it sometimes experiences issues with Redis causing random errors for the users. So we added decorator on IDistributedCache which catches exceptions and then fallback to MemoryCache.

Secondly, we are using Microsoft's dependency injection with HttpClientFactory and were calling Redis to configure HttpClient.

services.AddHttpClient<IClient, Client>()
  .ConfigureHttpClient((provider, client) => client.BaseAddress = provider.GetService<IDistributedCache>().Get("url"));

We are aware blocking async call is not good for performance, but HttpClients are cached so it's not a major issue. However, the problem with this is that Microsoft's implementation of HttpClientFactory caches not only HttpClient, but also any exceptions which occur during configuration. As the result single Redis exception during startup was able to take down the entire service.

Due to reasons we haven't yet released these changes so I'm not sure whether this will resole the original problem, but I believe exception caching was the source of our issues.

ghost avatar Aug 22 '22 05:08 ghost

Checking in here because some fixes that came later likely improve a few components in play here - any chance you've tried with the latest version of the StackExchange.Redis library and see if it more gracefully handles the interruptions? I've made a few improvements here which I hope at least improve what's happening in your case.

NickCraver avatar Oct 16 '22 15:10 NickCraver

Thank you for letting me know. We will update but I don't know when it will reach production.

Currently we are using StackExchange.Redis 2.5.61. Failure rate went down significantly after we have added workaround to handle issues with HttpClientFactory. We are seeing failure rate of 360 out of 41,000,000 (0.001%). There were 7 RedisConnectionException, all of which happened within 3 milliseconds so it's a single failure that resulted in multiple errors, and 353 RedisTimeoutException exceptions, which also happened in a few batches so I guess the problem has occurred only a few times and took a bit longer to recover.

In my opinion this issue can be closed as main issue was not with Redis client but due to HttpClientFactory caching timeout exception.

ghost avatar Oct 26 '22 08:10 ghost

Roger, closing out to tidy up :)

NickCraver avatar Nov 23 '22 19:11 NickCraver