Game hangs randomly
Alright, I'm gonna try to compile all I have in a single github issue so this become easier to investigate and if anyone has an idea, feel free to post it here.
The game hangs randomly
IW4x has a glitch that makes the game completely hang - no crash, no stop error. This bug exists since at least the 0.6.1 official release, but might be older. I could not reproduce it on the 0.6.0 release version.
Here are a few key points:
- This bug never happens on the host (whether they're dedicated or listen)
- When this bug happens, it happens for one or more clients, but not necessarily on every client at once
- This bug is very easily reproduceable with BotWarfare, maybe due to the high activity from bots
- This bug happens on vanilla maps (so far Rust, Overgrown, and Terminal, I have not tried reproducing on other maps)
- This bug has a random chance of occuring at any time, but I've had high success reproducing it at round end/start (more details below)
- The game freezes and is unrecoverable without a debugger / memory editor.
Other characteristics that i'm a bit more uncertain about:
- This bug seemingly does not occur on vanilla MW2 nor on other clients afaik (I've failed to reproduce it on aiwnet)
- It has a very low occurence, but once it happened once, it has a higher chance of re-occuring unless the host restarts
- ~~So far, this bug can not be reproduced ever on an AMD CPU.~~
- This is probably unrelated to Anticheat, as I've managed to reproduced this bug with
DISABLE_ANTICHEATdefined every single time
Informations so far
I've reproduced this bug on Terminal and Overgrown and made several dumps using Visual Studio. When the game hangs, the state is usually the following:
- The main thread is stuck inside the
StopSoundsubroutine (0x430C55) waiting on mss32.dll to stop all sounds. - Another worker thread (which I believe is dedicated to stream sounds from disk and nothing else) is stuck inside the
SND_ExecuteStreamReadsubroutine (0x64BAFE), waiting on mss32.dll to give a handle to play the sound - Every other thread is either sleeping, waiting for commands, or in otherwise states that do not sound suspicious to me
The reason these two threads (main thread & the stream worker thread) are waiting on mss32.dll is what makes the game freeze. Looking further, we can see that both of them are stuck in the same part of mss32 - waiting on a mutex for ressource availability. Normally, one thread would lock the mutex first, use the ressource, then free the mutex, and the other thread that would arrive late at the party would then wait for lock, take the ressource once it's available, and free the mutex again.
But here's what happens - for reasons I do not understand, at the end of a round, StopSound and SND_ExecuteStreamRead manage to fire at the exact same time and to lock the mutex simultaneously, both of them, and then hang forever waiting on each other to free the ressource. I do not know why this happens, nor what underlying effect is the cause of that, but this is where the freeze occurs. This could be the result of an invalid operation or state prior, but at the time I have no idea which.
The precise subroutine in Mss32 that hangs is AIL_lock_mutex at address 0x21101090 .
Why does this happen
I have no idea why and how this could be happening, as I have little experience with multithreading on the low-level side of things and it seems to me that it shouldn't be possible to cheat WaitForSingleObject the way it seems to happen unless there's some serious memory corruption - but again, i'm inexperienced in that matter.
Wild theories:
- The handle for the mutex that mss32 uses gets overwritten or cleared at some point, or its writing is delayed, creating a window of opportunity for two threads to lock the mutex at the same time (improbable, the mss32 address range is never referenced in iw4x code)
- This is a hardware bug that will happen no matter what depending on the kind of CPU you have (improbable as it seems specific to iw4x, but maybe bad hooks or too many of them create the right state to trigger that bug)
- iw4x alters data in soundaliases or msssounds in a way that makes miles sound system unable to handle them properly (i have yet to find where and how this could be happening)
- The data iw4x ships with is corrupted in some manner (highly improbable, the bug can be triggered on basemaps and iw4x's data is supposedly identical to steam data)
- Some zone forward-compatibility patch triggers it (improbable, they all check for zone version number and therefore should not execute on vanilla zones)
- ???
Solving this issue
Solving an issue of which the cause hasn't be determined, and that can be hardly reproduced, is a very complicated task. In theory, maybe the mss mutex could be wrapped in a mutex of our own to ensure correct thread safety, but that has no guarantee to work and the underlying issue causing this would not be solved. Replacing the mutex lock code from mss32 with ours is also not a possibility as far as we've tried, as this distorts the sounds and create audible glitches due to the mss-managed handle having an unexpected value.
Regardless, feel free to post any theory or idea about why this might be happening :)
Dumps
Due to the sheer size of dumps, I cannot attach any but I will upload them if you request them to me on iw4x's discord.
* So far, this bug can not be reproduced ever on an AMD CPU.
Not true, managed to reproduce it for the first time.
I joined a full terminal server that had a lot of stuff happening at the same time, players were using killstreaks constantly and it was a very hectic match in general.
I activated the built in lagometer to see if that indicated anything and it showed frequent network related lag, however this was not packet loss but the network thread completely cutting out for a fraction of a second (indicated by the green graph disappearing). The lagometer also showed no cpu/fps related lag, that did not cut out along with the network thread either. After a few minutes the game completely locked up without me doing anything to provoke that.
What's the status of this issue @Rackover ?
Same as before, no evolution / no change / no testing / no investigation was conducted since then