entry method called from inside CkExit
The docs at https://charm.readthedocs.io/en/latest/charm++/manual.html#execution-model say "The Charm RTS ensures that no more messages are processed and no entry methods are called after a CkExit." but there is a race condition where if CkExit is called not on pe 0 that user entry methods may be run by the CsdScheduler call nested inside of CkExit (and the user entry method that called it). Here is a simple test that demonstrates the issue (two CkExit messages from pe 1):
diff --git a/tests/charm++/simplearrayhello/hello.C b/tests/charm++/simplearrayhello/hello.C
index 28506afc6..b609114b2 100644
--- a/tests/charm++/simplearrayhello/hello.C
+++ b/tests/charm++/simplearrayhello/hello.C
@@ -62,6 +62,10 @@ public:
else
//We've been around once-- we're done.
mainProxy.done();
+ if ( CkMyPe() ) {
+ CkPrintf("[%d] CkExit[%d] from element %d\n", CkMyPe(), hiNo, thisIndex);
+ CkExit();
+ }
}
};
jim@denver$./hello +p2
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 2 threads (PEs)
Converse/Charm++ Commit ID: v7.1.0-devel-74-g2201bd839
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (2 sockets x 6 cores x 2 PUs = 24-way SMP)
Charm++> cpu topology info is gathered in 0.002 seconds.
Running Hello on 2 processors for 5 elements
[0] Hello 0 created
[0] Hello 1 created
[0] Hello 2 created
[1] Hello 3 created
[1] Hello 4 created
[0] Hi[17] from element 0
[0] Hi[18] from element 1
[0] Hi[19] from element 2
[1] Hi[20] from element 3
[1] CkExit[20] from element 3
[1] Hi[21] from element 4
[1] CkExit[21] from element 4
[Partition 0][Node 0] End of program
There is _discardHandler that replaces _charmHandlerIdx and _bocHandlerIdx but this doesn't happen until [Start]ExitMsg (pe 0) or ReqStatMsg is received by _exitHandler. I think this is done to support exit functions. What needs to happen is for CkExit to immediately install _discardHandler on the calling pe, then notify via pe 0 to install _discardHandler on all pes and wait for quiescence before starting any exit function.
There is still the possibility of a hang if CkExit() is called while holding a lock that another pe entry is waiting for, but the only way to handle that is to use std::lock_guard in user code and have CkExit() throw an exception rather than running a nested scheduler. Much simpler to forbid calling CkExit() while holding a lock and recommend CkAbort() for such cases.