ROSS Enhancement request: event cancellation/deferral/delay

The following use case has come for me a few times in the past in different contexts, so I thought I'd document the example here and discuss potential API solutions. In general, the problem I'm looking at is the issuing and recalling of speculative events that you may or may not want to execute depending on factors you don't know ahead of time.

Let's say I'm doing some sort of hearbeat protocol in which, if I don't hear from some process in x units of time, I consider that process dead. So, in the heartbeat process I issue an event representing a timeout for x units in the future for each process in the protocol. Now, the DES logic to handle the case where I do hear back from the server is a bit awkward - I need to process the event, hold onto the time the process responded to me and reissue timeout events for the new time, and in the timeout event check against that time and reissue if necessary. If x is not a tight bound w.r.t. the rate of contact, then the event queue gets filled up with a bunch of effectively dead events. A cleaner solution to this would involve being able to do some simple manipulations to the event queue. Namely, one (or more) of the following:

tw_event_cancel(tw_event *e); // removes previously sent event e from the queue
tw_event_reschedule(tw_event *e, tw_stime new_offset); // reissues event e for time new_offset in the future

With this change, the logic for resetting the heartbeat can reside with the logic that causes the resetting, without upping the event population significantly and having to process defunct events.

May 01 '15 20:05 JohnPJenkins

Random comment: I can think of one possible workaround.

When you issue the timeout event, store (in your state struct) a variable indicating the target timestamp value when that timeout should occur. If you get a heartbeat event, then you modify the timestamp value in the state, but you do not issue another timeout event or modify the one in progress. The old one remains in flight.

When the timeout event triggers, you compare tw_now() against the timestamp value in the state. If they do match it means that the timeout has triggered (because no heartbeat arrived to reset a new target timeout value). If they don't match, then the timeout condition has not occured yet: you reissue the timeout event to occur at the new target timestamp value.

Basically you keep exactly one timeout event in flight at all times, and when it pops you check to see if it is still valid or not. It will pop more times than strictly needed, but it is a local event, and it won't explode beyond one timeout per heartbeat server.

May 03 '15 14:05 carns

You are right, that could certainly be done for the timeout use-case.

May 04 '15 13:05 JohnPJenkins

For reference, OMNeT++ (another DES system) allows for canceling and rescheduling of self-scheduled events only. Would limiting the canceling/rescheduling of events to only self-scheduled events work for you? It seems pretty straightforward to implement the API functions you described... it would just require the sending LP to keep a pointer to the event.

May 04 '15 21:05 gonsie

That sounds like a reasonable restriction. I haven't yet encountered a case where we'd want to do this between two distict LPs.

May 04 '15 21:05 JohnPJenkins

This feature has been requested (and discussed) during several meetings with people from LLNL.

We've decided that event retraction is essentially solved by a tie-breaking mechanism where event retractions are seen as a model-level event. Therefore, the implementation of event retraction will not happen in ROSS core. Instead, the plan is for a new library to be implemented which will help model developers with the bookkeeping needed for event retractions.

Obviously, there is, as of yet, no tie-breaking mechanism in ROSS. This feature will be developed soon.

May 16 '16 22:05 gonsie

Sorry, got distracted before responding to these...

Can you clarify a bit what the mechanism would look like? For, say, the timeout example, would the flow look like:

send self event with timeout X (timeout_val = now()+X)
get message from external thing invalidating timeout
send another self event for time timeout_val-now() (danger: may not be exactly the same number as the timeout event, how would this get handled?)
get a callback that says these two events tied, check that they are timeout and "anti-timeout" messages, tell ROSS not to execute them.

?

Is there a simpler alternative implementation for self events, in which there's a retract flag in the tw_event struct and function call to set it? Then modelers just cache the generated event pointer and call retract as necessary, while ROSS just checks the flag prior to calling the (forward/reverse) event callback. Then you don't have to mess with the data structures. Of course, it's up to the model developer to understand the lifetime of the event pointer.

May 27 '16 15:05 JohnPJenkins