StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

Unexpected response to UNWATCH: SimpleString: QUEUED when using LockTakeAsync

Open rachlenko opened this issue 9 months ago • 15 comments

Description

I'm encountering an intermittent error when using LockTakeAsync to implement a check-and-set pattern. The method is being used to check if a key exists and create it if not. Error message:Unexpected response to UNWATCH: SimpleString: QUEUED

Environment

Using Microsoft.Extensions.Caching.StackExchangeRedis Operation: Simple key existence check with LockTakeAsync

Steps to Reproduce

Attempt to check if a key exists using LockTakeAsync If it doesn't exist, create the key The error occurs intermittently during this operation

rachlenko avatar Mar 23 '25 13:03 rachlenko

Environment questions:

  • library version?
  • what is the server here? Redis? Valkey? Garnet? Something else? And version?
  • are you using an intermediary proxy?
  • is this hosted, for example AWS/Azure/etc? The host may be adding a proxy on the edge, being the point

mgravell avatar Mar 23 '25 14:03 mgravell

Also: version of Microsoft.Extensions.Caching.StackExchangeRedis? Pretty sure the current code doesn't use the lock API

mgravell avatar Mar 23 '25 14:03 mgravell

  1. Server redis,
  2. no proxies,
  3. hosted in the linux vm running on vmware,
  4. I'm using the latest version of StackExchange.Redis, 2.7.27 This lib Microsoft.Extensions.Caching.StackExchangeRedis uses StackExchange.Redis The TakeLockAsync function is from StackExchange.Redis

rachlenko avatar Mar 26 '25 20:03 rachlenko

Also: version of Microsoft.Extensions.Caching.StackExchangeRedis? Pretty sure the current code doesn't use the lock API

I can send you the test which reproduce the problem

rachlenko avatar Mar 26 '25 20:03 rachlenko

A failing test would be perfect, but I'd still like to know the version of M.E.C.SER that you're using - it is highly relevant

mgravell avatar Mar 27 '25 08:03 mgravell

I don't use any methods from Microsoft.Extensions.Caching.StackExchangeRedis library. I only use StackExchange.Redis library.

rachlenko avatar Mar 30 '25 13:03 rachlenko

Using Microsoft.Extensions.Caching.StackExchangeRedis ... I don't use any methods from Microsoft.Extensions.Caching.StackExchangeRedis library.

Well that slowed me down for no reason, then!

I'm using the latest version of StackExchange.Redis, 2.7.27

2.7.27 is over a year old; latest is 2.8.31

Anyway! Checking the code, LockTakeAsync is implemented with StringSetAsync, and has been for 11 years - it doesn't use transactions (MULTI/EXEC/UNWATCH etc), because it doesn't have to - SETNX has existed since Redis 1.0, and SET EX NX has existed since Redis 2.6.12, so: 12 years (literally to the date).

If you want me to investigate, I'm going to need a repro, because the method you cite doesn't use the feature that is reporting a problem.

mgravell avatar Mar 31 '25 09:03 mgravell

Maybe you can suggest me an alternative method, because as I proved by test, the problem arrives when I use LockTakeAsync.

rachlenko avatar Mar 31 '25 15:03 rachlenko

@rachlenko I'm trying to understand if you're being intentionally dense here. This code does not call what you're asserting. Marc has asked for a repro multiple times. You have proven nothing at all. If you're unwilling to provide a repro (so we can show you were you're misunderstanding), then that's fine - godspeed, and we'll close this issue out.

NickCraver avatar Mar 31 '25 15:03 NickCraver

@NickCraver Thank you. I'll prepare the repo with test, and send the link. Many thanks for your help.

rachlenko avatar Mar 31 '25 18:03 rachlenko

I see the test provided, thanks. Looking at that, I suspect the problem is coming in from LockReleaseAsync - which does use a transaction, but which wasn't mentioned so I didn't look. I'll see if I can repro - thanks.

mgravell avatar Apr 01 '25 14:04 mgravell

I have your repo cloned locally (I won't paste any of it here); however... it isn't runnable nor is it clear what I'm meant to do with it; I try not to connect to random endpoints as a matter of routine, but I also note that the configured endpoints are not routable, so I couldn't if I tried. That's probably a good thing. Overall... I'm not sure what you want me to do to see the magic happen.

I do see a Delete method that would call into LockReleaseAsync, but: nothing calls that Delete method. I see code that calls LockTakeAsync via TrySet, but as previously discussed: this doesn't actually use transactions. It shouldn't cause this problem, but I note with interest that locker.TrySet isn't awaited in /lock/{key}, which means it returns 200 OK without any knowledge of what happened. This might lead to people thinking they have a lock when they don't, but as I emphasize: it shouldn't cause this problem.

Do you have anything directly runnable that shows the problem you're reporting? Ideally with minimal additional things like UI frameworks. For example, here's a console that runs lock/unlock over a bunch of workers (including one rogue worker who tries to unlock things that it doesn't hold):

using StackExchange.Redis;

using var muxer = ConnectionMultiplexer.Connect("127.0.0.1:6379");
var db = muxer.GetDatabase();

RedisKey key = "mykey";
var workers = new Task<int>[10];
for (int i = 0; i < workers.Length; i++)
{
    workers[i] = RunWorker(key, db, i);
}
Console.WriteLine("Awaiting completion...");
int rogue = await Task.Run(() => RunRogueWorker(key, db));
int total = 0;
foreach (var worker in workers)
{
    total += await worker;
}
Console.WriteLine($"Total retries: {total}; rogue: {rogue} releases");


async static Task<int> RunWorker(RedisKey key, IDatabase db, int id)
{
    var name = $"Worker {id}";
    Console.WriteLine($"Starting {name}...");
    var rand = new Random(Seed: id);
    int retries = 0;
    for (int i = 0; i < 50; i++)
    {
        while (!await db.LockTakeAsync(key, name, TimeSpan.FromSeconds(5)))
        {
            retries++;
            await Task.Delay(100);
        }
        // fake work
        await Task.Delay(rand.Next(10, 100));

        // unlock
        var unlock = await db.LockReleaseAsync(key, name);
        Console.WriteLine($"Work... {name} ({i}; {unlock})");
    }
    Console.WriteLine($"Completed {name} with {retries} retries");
    return retries;
}

async static Task<int> RunRogueWorker(RedisKey key, IDatabase db)
{
    // randomly attempt to release locks that I don't have...
    int fail = 0;
    for (int i = 0; i < 1000; i++)
    {
        var result = await db.LockReleaseAsync(key, Guid.NewGuid().ToString());
        if (result) fail++;
        await Task.Delay(TimeSpan.FromMilliseconds(10));
    }
    Console.WriteLine($"Rogue complete; {fail} releases");
    return fail;
}

It runs to completion without anything like you're reporting, with final output for example:

Total retries: 1917; rogue: 0 releases

mgravell avatar Apr 01 '25 15:04 mgravell

Observation from Locker.cs:

return await db.LockTakeAsync(key, key, TimeSpan.FromSeconds(10));
...
await db.LockReleaseAsync(key, key);

Once again, this shouldn't cause the error reported, but: using the key as the value is a really bad idea - the purpose of the value here is, in part, to prevent a worker from releasing someone else's lock, i.e.

  • worker A successfully takes a lock with expiration X
  • worker A starts the work
  • X time passes, the lock is effectively now released automatically
  • worker B successfully takes a lock with expiration X, because it is available
  • worker A comes back in and releases the lock (because the value matched), despite not actually holding it any longer

Lock expiration is always undesirable and usually indicates an error scenario, but by reusing the key as the value, you've caused propagating chaos - B thinks it has the lock, but A releases it; for all we know, C now acquires it, and B then releases it; and so on and so on - the lock becomes completely unreliable as soon as the value is shared between separate callers. Worse: once this spiral has started, we don't need to wait for repeated expiration between these incorrect outcomes.

Side note: if A is taking a non-trivial amount of time, there is a LockExtendAsync method that attempts to refresh the expiration without releasing the lock, also indicating whether this was successful. This should be used periodically by long running operations to prevent lock expiration. Importantly: if you try to extend a lock and get a false response, you should abort your operation, because you let the lock expire in the background: you no longer hold the lock.

mgravell avatar Apr 01 '25 15:04 mgravell

I'd like to thank you again for your explanation of the problem. Based on my current understanding, I need to find a workaround. Alternatively, I could use an alternative method, which should be atomic like LockExtendAsync. What would you recommend as the best approach?

rachlenko avatar Apr 04 '25 07:04 rachlenko

None of my additional comments should have caused (or indeed not caused) the problem you reported re UNWATCH, so: they're unrelated. The "workaround" to you not currently using LockExtendAsync is... to use LockExtendAsync. If I'm misunderstanding you, please be very explicit what exactly you're trying to do (and what your constraints are) - the details are really important.

mgravell avatar Apr 04 '25 10:04 mgravell