PauseForCollect may cause the GC to freeze.
void PauseForCollect()
{
if (sgIsCollecting)
CriticalGCError("Bad Allocation while collecting - from finalizer?");
#ifndef HXCPP_SINGLE_THREADED_APP
volatile int dummy = 1;
mBottomOfStack = (int *)&dummy;
CAPTURE_REGS;
#ifdef VerifyStackRead
VerifyStackRead(mBottomOfStack, mTopOfStack)
#endif
mReadyForCollect.Set();
** THE COLLECT PROCESS STARTS NOW, AND IF IT FINISHES BEFORE IT REACHES THE NEXT LINE THE PROCESS IS STUCK **
mCollectDone.Wait();
#endif
}
My solution. Remove mCollectDone semaphore. Remove ReleaseFromSafe. Replace PauseFor Collect with this code:
void PauseForCollect()
{
if (sgIsCollecting)
CriticalGCError("Bad Allocation while collecting - from finalizer?");
#ifndef HXCPP_SINGLE_THREADED_APP
EnterGCFreeZone();
gThreadStateChangeLock->Lock();
ExitGCFreeZoneLocked();
gThreadStateChangeLock->Unlock();
#endif
}
I have a similar issue in GlobalAllocator::Collect. The semaphore is never released.
@sibest is there any pull request with your proposal?
It does seem neater if I can do away with the extra semaphore logic. Acquiring the gThreadStateChangeLock rather than waiting on the mCollectDone before resetting mReadyForCollect looks good I think
I had some trouble reproducing this one. Any hints? I tried putting a 'sleep' in the code, but it did not help. Can you reproduce this one on windows or mac? One thing I noticed that if PauseForCollect is called from a GCFreeZone, then mCollectDone will not be set. It is an error to call pauseForCollect from a gc-free-zone, so this might be the error. I should put in a software check for this.
I am able to reproduce it polling socket accept on a websocket server and creating new connections from browser with a cadence of one new connection opened each 100 ms. It hangs in accept before reaching 50 opened connections.
I had a look at the code and I found that in _hx_std_socket_accept the native accept() function is indeed called inside a GC free zone.
The problem is actually not with the GC itself:
- A thread wants to start a collection and awaits for the other threads to be in the safe zone (WaitForSafe)
- Another thread registered in the LocalAllocator pool is in a "Sleep" state, for example in a Loop awaiting for messages, but never do any allocation. The GC freezes because the Collector thread awaits indefinitely for the other thread to be in the safe zone.
This is what rarely happens with Sockets and CURL in my experience. In my opinion an easy solution would be to use WaitSeconds instead of Wait in Collect(), give it like 0.5s of time per LocalAllocator to enter the safe zone or rather ignore it.
When a thread enters the GC free zone it also sets the gThreadStateChange lock, so the implementation of _hx_std_socket_accept seems correct. Can you debug and when the program locks, do pause the debug to see where the execution is hanging ?
Anyways ovidiugabriel you may try to make the changes i suggested to remove the semaphores logic and see if it helps.
Immix.zip I've attached the modified Immix.cpp
The _hx_std_socket_accept looks ok I think - the way it is supposed to work is the thread enters the gc-free-zone and while waiting for accept, any number of collections may happen. These collections should ignore the thread waiting in the accept. Once a value comes in, the exit-gc-zone acquires to thread lock (so it can be sure it is not collecting) and then carries on. But, given there is an error, there might be some bug somewhere - possibly in the timing of the enter/exit zone calls. A stack trace showing the offending thread waiting for the CollectDone should show whether it is a zone violation.
I can try spamming enter/exit thread in the background to see if I can generate an error, otherwise I will look at the socket code.
@sibest, A thread in a "Sleep" state should enter a gc-free-zone before making a blocking call to avoid this situation. The hxcpp std lib calls should do this I think (eg, accept, sleep etc), but if an external library is used, some special attention is required - especially if the thread then makes a callback into haxe.
I was also surprised to see that an infinite wait is used (no timeout).
ovidiugabriel Can you try my attached code to see if anything changes?
@sibest thanks for the patch. Unfortunately it doesn't change the behavior. I can send you the code to reproduce the problem.
Yes please, thank you
If you have a timeout, the bug will be "I get the odd 5-second pause in my game", or "my cpu is using 10% while the code is doing nothing" or "if the collection takes longer than 2000ms, the program crashes". It is theoretically possible to have no pause and no wasted CPU, and this is what I'm aiming for. The main problem with this sort of thing is reproducing the issue, so anything you do to help in this regard would be great.
@sibest I uploaded the zip archive with the code https://we.tl/t-DrC0kZPGwN
I am building it by running ./build.sh with MinGW installed.
With this standalone example, it stops somewhere immediately after 250 opened connections.
The number 250+ is quite suspicious. This could be the limit of open file (socket) handles, so I think one of the socket calls is failing (accept?). It could be a bug in this failure case, where the standard library has a bug, or in the application where it does not handle this case well. On mac, it looks like one thread is waiting in the main Sleep, and the other in the listen sleep, which seems correct. So maybe it is a windows bug.
Nope, it's not a Windows bug. It's impossible that a failure of "accept" cause the process to hang. Plus it's getting stuck even if you .close() the socket after accepting it.
@hughsando It doesn't hang forever, I left my desk for almost 15 minutes, when I got back I found the thread running, of course it got stuck again soon after that.
What I mean by a problem with accept is that if there was a bug handling the gc-free-zone in this case, not with the accept call itself. I have a feeling the 255 limit is from the browser end - if you open a second browser process, do you get a fresh batch of connections?
@hughsando @sibest thanks
I have updated my html test (new version https://we.tl/t-pJbZjWc1c3 ) to close all the sockets and refresh the page when 250 connections are reached in the browser. For the test, the number I am watching is a counter in the haxe code that I keep incrementing for each accepted connection, it is printed in console.
I also printed the exit from accept and it seems that it is not blocking anymore. I think my issue was that I have used Sleep instead of ::Sys_obj::sleep in the main thread and this was the fix for the whole issue.
Small suggestion: I think the documentation at https://haxe.org/manual/target-cpp-ThreadsAndStacks.html should be updated a bit - or the topic should be expanded on multiple page - the GC is a big topic.
@ovidiugabriel Yes that is the point of my first comment. If you use a blocking/non GC-free call it will cause the GC to stuck. Sys.sleep is in GC-free zone.
The easiest answer here is to really just the use haxe functions. I understand that you have put together a demo so the architecture is maybe not what you would use in production, but this code would be written in haxe with no native cpp. For example, the haxe thread class is easier to use than the native one and would be my first choice. If you have some native library you are trying to hook into, you might be able to reverse the control - that is, the app is mostly haxe, but with a small call-out to a library, rather then the app being in cpp, with a call into haxe for the sockets.
@ovidiugabriel Yes that is the point of my first comment. If you use a blocking/non GC-free call it will cause the GC to stuck. Sys.sleep is in GC-free zone.
@sibest You are right, I was testing with your fix to Immix.cpp applied. But without this fix it is still working fine. My understanding is that in this case I actually have not managed to reproduce the issue you initially described. Have you managed to reproduce it using my example?
@hughsando In production, our app, is a C++ GUI app, we used haxe code linked as a DLL, for this reason it is crucial to understand how to correctly call haxe code from C++. Even if I am using haxe threads, the code that calls the haxe entry point function is executed inside a native thread anyway due to the architecture of our production app and we still need to obey the rules of using threads with GC.