StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

[BUG] Migrate command for certain keys blocking Primary, and causing Failover.

Open javedsha opened this issue 1 year ago • 4 comments

Describe the bug

During re-shard operation, we are running Migrate command for all keys in hash slots. For some particular keys, the migrate command times out. As migrate is a blocking command, it blocks the primary and Redis thinks the primary is down, and starts a failover.

This is causing us cache availability drops, as we don't want Redis to do failover.

We tried reducing the timeout to 3 seconds, but still we get the same behaviour.

Code:

await db.KeyMigrateAsync(key, endpoint, timeoutMilliseconds: 3000);

Even though we are explicitly setting the timeout to 3 seconds, it takes the 4 seconds timeout from ConnectionMultiplexer.

StackExchange Error Logs:

TimeStamp: 2024-03-08T16:53:32.534223Z Timeout awaiting response (outbound=0KiB, inbound=0KiB, 4100ms elapsed, timeout is 4000ms), command=MIGRATE, next: some_random_key, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 2, cur-in: 0, sync-ops: 0, async-ops: 2193844, serverEndpoint: 172.20.0.6:6380, conn-sec: 670.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: mtcache000002(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 9271, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=6635139), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts),

We get two of such errors before Redis failover, in redis, the node_timeout is set to 5 seconds.

Redis Logs:

16:53:35.041 * FAIL message received from 1a52537ed371931ec4436e02afdaae61fd061c17 about 42b37d2039622543514545a6cba3807e4db0b776 16:53:35.133 # Start of election delayed for 805 milliseconds (rank #0, offset 1269560724769).

The timestamp at which the migrate is timing out and Redis deciding to failover explains that the migrate timeout is causing this.

Update:

The first key which is blocking is of size 300 MB and has 2.3 Million keys. It is an hashset.

So what is the recommend way to migrate a hashset?

javedsha avatar Mar 08 '24 22:03 javedsha

On this part:

Even though we are explicitly setting the timeout to 3 seconds, it takes the 4 seconds timeout from ConnectionMultiplexer.

Timeouts are evaluated in the heartbeat, so by default once per second (adjustable in configuration), so that's why you may see up to 4 seconds here.

On the advice: it really depends what the topology is for keys of that size. They take as long to move as Redis takes (not a client issue - nothing we can change there). Do you have any idea what the hardware of these servers and the bandwidth/latency in-between the endpoints looks like?

If the problem is purely on the client and you're not seeing server-wide stalls from this, then you may simply want to issue the command on another multiplexer just for the migration so it's only blocking/waiting on itself. To be super clear: the client isn't migrating anything, we're only issuing a small command telling Redis server to do so, nothing flows through the client in this scenario.

NickCraver avatar Mar 09 '24 15:03 NickCraver

Out of curiosity, I'd also be interested in seeing what SLOWLOG says here; as Nick says, the actual migration is done by the server, not the client - I wonder whetherSLOWLOG reflects this.

On Sat, 9 Mar 2024, 15:00 Nick Craver, @.***> wrote:

On this part:

Even though we are explicitly setting the timeout to 3 seconds, it takes the 4 seconds timeout from ConnectionMultiplexer.

Timeouts are evaluated in the heartbeat, so by default once per second (adjustable in configuration), so that's why you may see up to 4 seconds here.

On the advice: it really depends what the topology is for keys of that size. They take as long to move as Redis takes (not a client issue - nothing we can change there). Do you have any idea what the hardware of these servers and the bandwidth/latency in-between the endpoints looks like?

If the problem is purely on the client and you're not seeing server-wide stalls from this, then you may simply want to issue the command on another multiplexer just for the migration so it's only blocking/waiting on itself. To be super clear: the client isn't migrating anything, we're only issuing a small command telling Redis server to do so, nothing flows through the client in this scenario.

— Reply to this email directly, view it on GitHub https://github.com/StackExchange/StackExchange.Redis/issues/2668#issuecomment-1986878686 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEHMBZFJYN7FSCO6BXALDYXMPZ3BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVAYTONZUHAYTQM4CUR2HS4DFUVUXG43VMWSXMYLMOVS2UMRRG43DQNRTGI2TDJ3UOJUWOZ3FOKTGG4TFMF2GK . You are receiving this email because you are subscribed to this thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

mgravell avatar Mar 09 '24 21:03 mgravell

I agree that the issue is on server side (confirmed with slowlog). I tried migrating the same key directly on the server using 'migrate' command, and I see same issue - Primary failover. I have logged a ticket to Redis team, to check why Migrate is blocking either of the primary for such a long time.

@NickCraver - I tried running Migrate again by setting migrate timeout to 1 sec, and Connection Multiplexer timeout to 10 seconds. It still fails, and it is ignoring the timeout set on the migrate function.

await db.KeyMigrateAsync(big_hash_key, targetEndpoint, timeoutMilliseconds: 1000, migrateOptions: MigrateOptions.Replace);

Error -> As you can see it is taking the connection multiplexer timeout.

Exception: Timeout awaiting response (outbound=0KiB, inbound=0KiB, 10040ms elapsed, timeout is 10000ms), command=MIGRATE, next: MIGRATE big_hash, inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, last-in: 0, cur-in: 0, sync-ops: 0, async-ops: 42095, serverEndpoint: 172.20.1.24:6380, conn-sec: 252.33, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: 4a84b29464a5(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 15796, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=20,Max=32767), POOL: (Threads=10,QueuedItems=0,CompletedItems=185309), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) (StackExchange.Redis.RedisTimeoutException)

The timeout passed in the KeyMigrateAsync command seems to be ignored.

I can ran migrate directly on Redis server with migrate timeout set to 1 sec and it works, i.e., sometimes the migrate finishes and sometimes it fails - but main thing is - it doesn't causes Primary failover. How to achieve the same with StackExchange.Redis?

javedsha avatar Mar 10 '24 23:03 javedsha

I notice in your command exception there's a list of migrates running - are you sure it's this one that's throwing the exception? (it defaults to the Multiplexer's timeout if not specified). I'm not sure why we'd ever ignore the argument here, unless we just didn't even get to this migration, because the one before without a timeout errored for example, causing the one in the error not to execute or in time.

NickCraver avatar Mar 11 '24 13:03 NickCraver