Node icon indicating copy to clipboard operation
Node copied to clipboard

Break the habit with acceptance of waiting recorders to time out

Open bertllll opened this issue 1 year ago • 0 comments

I have a reasoned assumption that there are tests in the codebase which make use of time-outs that originally were intended just for the state of emergency, when beyond this point the test would be considered hanging. These timeouts were implemented through the special auxiliary Actor called SystemKiller. This concept is probably used at also some other places, and it should be dealt with across the board, but here I'm talking mainly about the specific case when a test is set to be ended by so called StopConditions.

As you might know, the Actor system never stops unless it's asked to. It had caused a lot of issues in writing good tests because only System::current().stop() wouldn't serve us always, as some part of the planned actions in the Act mightn't have been carried out fully. Therefore the test wasn't reliable because of races of threads. The StopConditions allow specifying a message or a set of them for a recorder impersonating one of our normal Actors, these are then the conditions of stopping the system as soon as the recorder receives all the conditioned messages.

Because there is always a chance that the recorder doesn't receive them as the tested code might work differently than the developer thought it would - which points probably to a design issue - we put in also the SystemKiller, ensuring no more than 10 seconds for waiting if the messages would arrive.

The problem is we exploit this security measure sometimes. For example, I've seen a test where the requirement was to prove that a certain was NOT sent. Who has ever thought about these tests thoroughly must know it's one of the trickiest requirements possible. There isn't a good recommended approach solving all such difficulties of the person writing tests like these.

This card originally came to existence with some bigger aspiration (of setting helpers for asserting on messages that don't happen) but I later realized that a general solution would be probably an overkill, as I knew how hard designing and implementing it could be. (For example, holistically approached and with really a clear supervision over the activity, we would have to have some smart wrappers around the web3 functions, perhaps the recipients only. Which I find quite a big change.).

Main Goal: The primary issue is that timeouts are easily overlooked because they're too polite and hidden, but we actually want tests to terminate as failures. In fact, we may not even need any services from the SystemKiller - a standard panic triggered after the chosen delay would suffice. I'm convinced that whether we use the special Actor approach or not, we need to handle this with a panic. Expected Impact: Once the above suggested changes are deployed, I suspect running our full test suite won't be possible anymore. I have indications that at least two tests will fail. Next Steps: We should then review these failing tests and try to fix them in a more appropriate manner. Discussion Point: If we encounter tests that try to verify an unsent message, for now I would only recommend having a potentially fruitful discussion about this phenomenon. Finding an effective solution would provide us with an approach we've wanted many times but never managed to define for our already rich mental toolkit for writing code in Rust and Actix.

bertllll avatar Nov 12 '24 01:11 bertllll