Receiving other thread HGETALL request result on high memory pressure and multiple threads
Hello,
While testing the StackExchange.Redis library (on HGETALL api) under high memory pressure to induce frequent OutOfMemory exceptions, I encountered cases where received data is valid but not expected and appears to be the query result from an other thread.
Tested with this program: https://github.com/jcaspes/StackExchange.Redis/tree/MemoryTester/toys/MemoryTester
In this program, each thread only performs infinite HGETALL operations on a key with its thread ID, so it should only retrieve data that it has written itself.
Here a sample output; we can see that thread 218 (Th218hT) retrieved valid data that was queried by thread 58 (Th58hT)
218 : Error in key, expected:Th218hT, actual:Th58hT, value is a valid but from an other query: True
Hello,
I reproduced deterministically the issue in this test program. For this corruption to happen, we need at least:
- 2 threads (if we use only synchrone commands)
- 2 exceptions thrown at specific location in the library StackExchange.Redis
Explanation step by step
- A Redis command is sent (eg. "GET key1") and proceeds to
PhysicalBridge.WriteMessageToServerInsideWriteLock. This method is responsible to first enqueue the command toPhysicalConnection._writtenAwaitingResponsethen to send the command to the Redis server. However, a first exception is thrown between these two operations, and the command is enqueued inPhysicalConnection._writtenAwaitingResponsebut not sent to the Redis server. - The catch block in
PhysicalBridge.WriteMessageToServerInsideWriteLockshould kill the connection, which should lead to a newPhysicalConnectionbeing created, with a new (clean) queue. However, a second exception occurs in this catch block, preventing this cleanup. - The catch block in
PhysicalBridge.WriteMessageTakingWriteLockSynctriggers and mark the command in error. - The command having a
WriteResultin error, it does not wait inConnectionMultiplexer.ExecuteSyncImpland return immediately in error. - Another command is sent (eg. "GET key2") and performs all the usual operations. Notably, it enqueues itself to
PhysicalConnection._writtenAwaitingResponseand is sent to the Redis server. Then, it waits the response inConnectionMultiplexer.ExecuteSyncImpl. - The Redis server answer for the command "GET key2", and the method
PhysicalConnection.MatchResultis performed. In this method, the first command ofPhysicalConnection._writtenAwaitingResponseis dequeued (here it is "GET key1" instead of "GET key2") and is signaled as being completed. In this case, the command "GET key1" is already finished so the signal is lost. However, the command "GET key2" is still waiting its signal. - Before the timeout of "GET key2", a third command is sent (eg. "GET key3") and performs all the usual operations. It enqueues to
PhysicalConnection._writtenAwaitingResponseand is sent to the Redis server. Then, it waits the response inConnectionMultiplexer.ExecuteSyncImpl. - The Redis server answer for the command "GET key3", dequeues the first command ("GET key2") and signal its completion.
- The command "GET key2" returns, with the value of "key3" instead of "key2".
- The command "GET key3" ends in timeout.
Fix
This particular problem is fixed by using the "HighIntegrity" configuration. Indeed, with this mode, each command has a unique value identifying it, and an echo is sent after each command with this specific value. When the Redis response is read at the step 6, it is with the identifier of the command "GET key2", whereas the first command in the queue is "GET key1", so the mismatch is detected. In the same way, the mismatch between the second response for "GET key3" is not sent to the second command "GET key2", and the corruption is prevented.
To do
In the program MemoryTester, we observed corruptions even with the flag HighIntegrity, so there should be another bug somewhere. For the moment, I will continue to dig to try to find the cause