StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

Upgrade 1.2.6=>2.6.7 caused GET operation 2-4 times slower?

Open tlutlu opened this issue 2 years ago • 17 comments

I created two simple console apps with same code. Set single cache, get 10 times with single thread, compile in release mode. Test was performed on same client VM, pointing to same Redis cluster (v6) using same connection string.

Left - StackExchange.Redis 1.2.6 + .Net Framework 4.5.2 Right - StackExchange.Redis 2.6.7 + .Net Framework 4.8

New driver GET is 2-4 times slower. Wondering whether there is a TURBO button I haven’t pressed within new driver? Did I miss anything?

image

`static async Task Main(string[] args) {

        var sizeOfKeyInKB = !args.Any() || string.IsNullOrWhiteSpace(args[0]) ? 10000 : int.Parse(args[0]);
        var iteration = !args.Any() || string.IsNullOrWhiteSpace(args[1]) ? 10 : int.Parse(args[1]);

        ConnectionMultiplexer redis = ConnectionMultiplexer.Connect(_redisConnectionString);
        IDatabase db = redis.GetDatabase();
        TestSync(sizeOfKeyInKB, iteration, db);
        await TestAsync(sizeOfKeyInKB, iteration, db);
    }

    private static void TestSync(int sizeOfKeyInKB, int iteration, IDatabase db)
    {
        var key = ($"{version}_Sync_" + Guid.NewGuid().ToString()).ToLower();

        var timer = new Stopwatch();
        timer.Start();
        db.StringSet(key, new string('*', sizeOfKeyInKB * 1000));
        timer.Stop();
        Console.WriteLine($"{DateTime.Now} PST - {version}_Set key {key} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s");

        for (int i = 1; i <= iteration; i++)
        {
            timer.Restart();
            try
            {
                var cache = db.StringGet(key);
                if (!cache.HasValue)
                    Console.WriteLine($"{DateTime.Now} PST - Failed get cache {key}");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"{DateTime.Now} PST - {version}_GET Exception {ex}");
            }
            timer.Stop();
            Console.WriteLine($"{DateTime.Now} PST - {version}_GET Iteration {i} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s ");
        }

        timer.Restart();
        db.KeyDelete(key);
        timer.Stop();
        Console.WriteLine($"{DateTime.Now} PST - {version}_Delete key {key} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s");
    }

    private static async Task TestAsync(int sizeOfKeyInKB, int iteration, IDatabase db)
    {
        var key = ($"{version}_Async_" + Guid.NewGuid().ToString()).ToLower();

        var timer = new Stopwatch();
        timer.Start();
        await db.StringSetAsync(key, new string('*', sizeOfKeyInKB * 1000));
        timer.Stop();
        Console.WriteLine($"{DateTime.Now} PST - {version}_SetAsync key {key} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s");

        for (int i = 1; i <= iteration; i++)
        {
            timer.Restart();
            try
            {
                var cache = await db.StringGetAsync(key);
                if (!cache.HasValue)
                    Console.WriteLine($"{DateTime.Now} PST - Failed get cache {key}");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"{DateTime.Now} PST - {version}_GetAsync Exception {ex}");
            }
            timer.Stop();
            Console.WriteLine($"{DateTime.Now} PST - {version}_GetAsync Iteration {i} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s ");
        }

        timer.Restart();
        await db.KeyDeleteAsync(key);
        timer.Stop();
        Console.WriteLine($"{DateTime.Now} PST - {version}_DeleteAsync key {key} - {sizeOfKeyInKB / 1000} MB - {timer.Elapsed.TotalSeconds:0.00}s");
    }`

tlutlu avatar Feb 15 '23 06:02 tlutlu

Hello! I have the same issue:

Upgrading from 2.2.6 to 2.6.81 (tested multiple versions, including the latest)

Test Configuration:

.Net Framework 4.7.2 (no we can't use NET6 at the moment!)

1 test instance, 300 test threads, 200 minio threads, 10 redis multiplexers

Test duration: 30 minutes Reporting threshold: 4 seconds

Redis 2.2.4 (old): - no problems whatsoever.

Redis 2.6.81 (latest): - at some point went into timeouts (Exceptions), then multiple responses with total time more than 5 seconds: 1280582,5431,0 redis.loadgenerator.exe Warning: 0 : Error retrieving object from RedisCache :- Key = sdsserviceskey_nikLocal, Type = String, Error=StackExchange.Redis.RedisTimeoutException: Timeout performing GET (5000ms), next: SETEX sdsserviceskey_nikLocal, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 2, cur-in: 0, sync-ops: 1317593, async-ops: 2, serverEndpoint: 20.165.71.142:15000, conn-sec: 882.04, mc: 1/1/0, mgr: 0 of 10 available, clientName: A4X76UL4C000000(SE.Redis-v2.6.96.30123), PerfCounterHelperkeyHashSlot: 1892, IOCP: (Busy=0,Free=1000,Min=200,Max=1000), WORKER: (Busy=0,Free=2047,Min=200,Max=2047), v: 2.6.96.30123 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts) at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 1941 at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server, T defaultValue) in //src/StackExchange.Redis/RedisBase.cs:line 62 at StackExchange.Redis.RedisDatabase.StringGet(RedisKey key, CommandFlags flags) in //src/StackExchange.Redis/RedisDatabase.cs:line 2996

==================

Worker thread code: (let me know if you want a full test source)

` while (true) { if (ct.IsCancellationRequested) break;

            foreach (var request in _requests.Requests())
            {
                if (ct.IsCancellationRequested)
                    break;

                var randomLong = rnd.NextInt64() % 1000;

                st.Reset();
                st.Start();

                var actualKey = $"{request.Key}{randomLong}";

                var value = "";

                bool readResult = _redisCache.ReadFromCache(actualKey, ref value);
                st.Stop();

                long writeMsec = 0;
                if(!readResult)
                {
                    wt.Reset();
                    wt.Start();
                    var r = _redisCache.AddToCache(actualKey, TestConstants.Value, TimeSpan.FromSeconds(300), false);
                    wt.Stop();
                    writeMsec = wt.ElapsedMilliseconds;
                }

                st.Stop();
                if (st.ElapsedMilliseconds > 4000)
                {
                    var timeOffset = DateTimeOffset.UtcNow.ToUnixTimeMilliseconds() - _startTime;
                    Console.WriteLine($"{timeOffset},{st.ElapsedMilliseconds},{writeMsec}"); // {ThreadId}");
                }
            }
        }

. . . . .

   public bool ReadFromCache<T>(string key, string region, ref T value)
    {            
        try
        {
            // REDIS cluster requires multikey operation to be done in the same shared.
            // Prefix all keys within a region by the region name.
            // Note: region name uses REDIS hashtag {}
            //
            if (!string.IsNullOrEmpty(region))
                key = $"{region}_{key}";

            byte[] bytes = (byte[])GetRedisDatabaseConnection().HashGet(key: region, hashField: key);
            if (null == bytes)
                return false;

            using (var stream = new MemoryStream(bytes))
            {
                if (_useBinaryFormatter)
                {
                    value = (T)new BinaryFormatter().Deserialize(stream);
                }
                else
                {
                    string json = (new StreamReader(stream)).ReadToEnd();
                    Type typeOfT = typeof(T);
                    if (typeOfT == typeof(string))
                        value = (T)Convert.ChangeType(json, typeOfT);
                    else
                        value = JsonConvert.DeserializeObject<T>(json);
                }
            }
            return false;
        }
        catch (Exception e)
        {
            TraceException(e, string.Format("Error retrieving object from RedisCache Hash:- Key = {0}, Region = {1}, Type = {1}", key, region, typeof(T).Name));
            return false;
        }
    }

   IDatabase GetRedisDatabaseConnection()
    {
        return GetRedisDatabaseConnectionMultiplexer().GetDatabase();
    }

    ConnectionMultiplexer GetRedisDatabaseConnectionMultiplexer()
    {
        int n = (_redisConnections.Length <= 1) ? 0 : GetNextConnectionMultiplexerNumber();
        return _redisConnections[n].Value;
    }

    int _lastNumber;

    int GetNextConnectionMultiplexerNumber()
    {
        int initial, computed;
        do
        {
            initial = _lastNumber;
            computed = initial + 1;
            computed = computed > (_redisConnections.Length - 1) ? 0 : computed;
        }
        while (Interlocked.CompareExchange(ref _lastNumber, computed, initial) != initial);
        return computed;
    }

.........

   public RedisCache(AppConfigManager mgr)
    {
        var redisConnectionString = mgr.GetSetting("RedisCacheConnectionString");
        var useBinaryFormatter = false;

        // Set thread pool growth (for redis cache) https://github.com/StackExchange/StackExchange.Redis/tree/master/Docs/Timeouts.md
        int minThreads = mgr.GetSettingAsInt("NumberOfMinIoThreads", 200);
        ThreadPool.SetMinThreads(minThreads, minThreads);
        
        _useBinaryFormatter = useBinaryFormatter;
        var numberOfMultiplexers = mgr.GetSettingAsInt(Constants.NumberOfRedisConnectionMultiplexers, 10);
        if (numberOfMultiplexers < 1)
            throw new Exception($"{Constants.NumberOfRedisConnectionMultiplexers} configuration value should be greater than 0");
        this._enableTrace = mgr.IsEnabled(Constants.EnableCacheTraceConfigSettingName);
        _redisEndpointName = redisConnectionString.Substring(0, redisConnectionString.IndexOf(':'));
        _redisConnections = new Lazy<ConnectionMultiplexer>[numberOfMultiplexers];
        for (int i = 0; i < numberOfMultiplexers; i++)
        {
            _redisConnections[i] = new Lazy<ConnectionMultiplexer>(() =>
            {
                var redisConfig = ConfigurationOptions.Parse(redisConnectionString);
                return ConnectionMultiplexer.Connect(redisConfig);
            });
        }

        // Default is Absolute, so that the cache is self-healing
        CacheExpirationType = CacheExpirationPolicyType.Absolute;
    }

`

smalgin-esri avatar Mar 16 '23 22:03 smalgin-esri

Dear devs, we had to scramble to update our deployments across the board because an innocuous Redis upgrade triggered wholesale slowdown across the board!

Please take note & fix this.

I'll test the async Redis API and we'll see if that will do as a workaround.

Thank you.

smalgin-esri avatar Mar 16 '23 22:03 smalgin-esri

@smalgin-esri I'm very curious if you saw this behavior in 2.6.90 as well, e.g. if it's new to 2.6.96 where some locking behavior did shift slightly to avoid a deadlock scenario. I see that your socket manager pool is exhausted when this happens, so curious if you're hitting a scenario we didn't factor in there.

NickCraver avatar Mar 17 '23 00:03 NickCraver

@NickCraver - the scenario is very simple - we read (and write when expires) a single string value. 300 read threads, 20 multiplexers. We saw this behavior as soon as I upgraded from 2.4 to 2.6.86.

I tested 2.4, 2.6.86 and latest (which is 2.6.96) - the problem began as early as 2.6.86. Also important: test client should use .Net Framework!

Edit: The Redis upgrade also pulled the newer Pipelines.Sockets.Unofficial. I didn't do the tests yet to figure out if it's related or not.

smalgin-esri avatar Mar 17 '23 18:03 smalgin-esri

@smalgin-esri I think we may have different definitions of simple :) By default those 20 multiplexers are using the same 10-thread SocketManager thread pool for example. What's the purpose of 20 multiplexers in this case? We might need to go to memory dump here to see why you're locking up - the scenario you're in is definitely not a normal use case, nor a simple one.

NickCraver avatar Mar 18 '23 21:03 NickCraver

@NickCraver - we use multiple 10 multiplexers to get optimal performance across parallel channels keeping mind the batching and pipelining that StackExhchange performs on behalf of us. We also currently use the synchronous methods.

What sticks out to you as unusual in our usage scenario? Is it the concurrent load (300 threads), the configuration (minIO threads, num_multiplexers) or the synchronous usage (Get and HashGet) pattern? Thanks.

nshampur avatar Mar 20 '23 21:03 nshampur

@NickCraver I'll collect a dump for you. I also re-test the same scenario running on NET6, just to be sure it's platform-specific.

smalgin-esri avatar Mar 20 '23 21:03 smalgin-esri

@NickCraver I collected a dump, here

smalgin-esri avatar Mar 22 '23 23:03 smalgin-esri

Confirmed - identical code running on NET6 has no issues.

smalgin-esri avatar Mar 23 '23 00:03 smalgin-esri

@smalgin-esri @nshampur we've been swamped in other issues this week, so trying to catch up a bit:

  1. It's unusual to have that many multiplexers, especially sharing a socket manager - normally we'd see a migration to async code and a lot more efficiency there. High volume for the sync case isn't a primary goal, because it's tremendously inefficient overall, but we also don't intentionally gimp it in anyway, but switching threads and contention is expensive.
  2. When you say "identical code running on NET6 has no issues" what does that mean? You see the same performance across version changes in the library? Or .NET 6 with latest SE.Redis shows about the same performance as 1.2.6 on .NET Full Framework?

Overall, processors don't have 300 cores (well, yet anyway), so you're pausing and context switching workloads in ways that aren't always predictable or optimizable...async is a lot more predictable here and with those kinds of numbers, absolutely what I'd recommend, but I also know it's viral and not always easy. If .NET 6+ is on the table for you (guessing because of the test above), you're going to get tons of threading and throughput benefits because these core concepts got immense love in .NET Core and onward...but I'm not sure what is or isn't an option based on the brief descriptions above, so: asking :)

NickCraver avatar Mar 26 '23 12:03 NickCraver

@NickCraver I meant exactly this: exactly same test code ported to NET6 project AND referencing the same version of Redis library had no performance 'hiccups'.

Yes we migrate to .Net Core wherever we can, but parts of our code are still on .Net Framework because of 3rd party dependencies.

At this moment we don't have any other workaround for these except using older version of the Redis nuget.

Edit: I'll test async, thanks! That's our next step.

smalgin-esri avatar Apr 25 '23 20:04 smalgin-esri

@smalgin-esri did you find anything?

natl-set avatar May 23 '23 04:05 natl-set

I've been having similar problems after updating from v1.2.6 to v2.6.96 within a .NET 6 web application. The only change we're making is updating StackExchange.Redis between those versions and we go from no timeouts to fairly consistent timeouts. All the calls to StackExchange.Redis are using the async calls.

here's an example of one of the timeout exceptions.

Message=Timeout awaiting response (outbound=0KiB, inbound=0KiB, 59ms elapsed, timeout is 50ms), command=GET,
next: GET [redacted], inst: 0, qu: 0, qs: 1, aw: False, bw: Inactive, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, 
last-in: 1, cur-in: 0, sync-ops: 0, async-ops: 99, serverEndpoint: [redacted].use1.cache.amazonaws.com:6379, 
conn-sec: 452.35, mc: 1/1/0, mgr: 10 of 10 available, clientName: HUD010769(SE.Redis-v2.6.96.30123), IOCP: 
(Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=10,Max=32767), POOL: 
(Threads=9,QueuedItems=0,CompletedItems=9771), v: 2.6.96.30123 (Please take a look at this article for some common 
client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

jamiegs avatar Aug 01 '23 21:08 jamiegs

@jamiegs What you're likely seeing is that 2.x actually enforces timeouts for async things, unlike 1.x. The message you have there looks correct to me, but 50ms is a very low timeout for a cloud service in general (as a ceiling). If you time your commands with profiling in 1.2.6, you'll see they're very likely spiking above your 50ms threshold, but 1.2.6 wasn't actually handling this for async paths.

NickCraver avatar Aug 01 '23 22:08 NickCraver

@NickCraver thanks for the response. That makes sense. what would you recommend for a reasonable production timeout value? I see the default is set at 5000ms which seems high to me.

jamiegs avatar Aug 01 '23 23:08 jamiegs

It really depends on your use case and what the environment is. What is your threshold where you want to alert and no longer wait? For most apps, a slow response is better than a timeout/failure so the default is higher. But in cases of massive load where 5 seconds is way too many requests to pile up (spiral of death), then it should be lowered.

NickCraver avatar Aug 01 '23 23:08 NickCraver

@smalgin-esri did you find anything?

I don't have resources to fork stackExchange.Redis and dig, so no - we just don't upgrade Redis.

The plan is to switch to async calls & then do the tests again.

smalgin-esri avatar Aug 02 '23 18:08 smalgin-esri