Out of Free Events Error Handling
In its current state, when the simulation runs out of free event buffers, ROSS throws an error suggesting increasing --extramem= and exiting the simulation, requiring that the user try increasing this parameter and restarting.
Would it be possible to instead force a premature GVT update at this point to do some stale event recollection to see if this resolves the issue and then resume the simulation?
There should still probably be some stdout warning about what happened so that the user can know why their simulation is taking a lot of time if this forced GVT update happens really frequently. Maybe make this an opt-in feature via a command line argument so that a user who knows the risk of turning their optimistic simulation into something potentially worse than conservative if --extramem isn't set appropriately. But it might be better than killing a potentially 10 hour long running simulation.
There will need to be a check to see if the time since the last GVT is 0 to prevent the endless loop of "Out of events, perform GVT to recollect, still out of events, perform GVT to recollect..."
Thanks Neil;
Yes - we could. However, what we need to implement is the Cancelback Protocol which does exactly as you say - it enables the reclaiming of event memory that has been optimistically scheduled. See the attached IEEE TPDS paper from 1997 by Das and Fujimoto :-).
The set of events you have to keep are those scheduled prior to the current GVT but would not be executed until after GVT.
I believe the LLNL folks may have implemented a form of Cancelback in their branched version of ROSS. It's at least a user level event retraction capability which might be useful for implementing Cancelback. We can touch base with them on the status of their implementation.
thanks again!!, Chris
On Fri, Jan 3, 2020 at 3:37 PM Neil McGlohon [email protected] wrote:
In its current state, when the simulation runs out of free event buffers, ROSS throws an error suggesting increasing --extramem= and exiting the simulation, requiring that the user try increasing this parameter and restarting.
Would it be possible to instead force a premature GVT update at this point to do some stale event recollection to see if this resolves the issue and then resume the simulation?
There should still probably be some stdout warning about what happened so that the user can know why their simulation is taking a lot of time if this forced GVT update happens really frequently. Maybe make this an opt-in feature via a command line argument so that a user who knows the risk of turning their optimistic simulation into something potentially worse than conservative if --extramem isn't set appropriately. But it might be better than killing a potentially 10 hour long running simulation.
There will need to be a check to see if the time since the last GVT is 0 to prevent the endless loop of "Out of events, perform GVT to recollect, still out of events, perform GVT to recollect..."
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ROSS-org/ROSS/issues/172?email_source=notifications&email_token=AAHVJE6PVJJWXZSFBDV2TILQ36OYVA5CNFSM4KCRU7O2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4ID6MDTA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHVJE67IOGQZKZJ7REHTL3Q36OYVANCNFSM4KCRU7OQ .
--
Christopher D. Carothers
Director, Center for Computational Innovations Professor, Department of Computer Science Rensselaer Polytechnic Institute 110 8th Street Troy, New York 12180-3590
e-mail: [email protected] web page: www.cs.rpi.edu/~chrisc http://www.cs.rpi.edu/%7Echrisc phone: (518) 276-2930 fax: (518) 276-4033
We (here at LLNL) are looking at lazy rollback. But we would be very interested in a cancelback impelmentation if you wanted to tackle that @nmcglohon 😄
I could probably knock it out not too long after my next paper deadline. Assigning to myself.