SmartSim icon indicating copy to clipboard operation
SmartSim copied to clipboard

Add timeout to RunSettings

Open al-rigazzi opened this issue 2 years ago • 3 comments

Description

A timeout functionality should be added to Experiment.start call, and limit the max execution time of a manifest. The timeout should be specified in seconds, possibly as an int value of the block argument, and specifying True should result in uncapped execution time, whereas False would retain its current meaning, a synonym being block=0. Values < 0 should not be accepted.

Justification

The blocking start call is normally used to run applications and wait for their outcome before moving to the next script line. Sometimes, though, applications can get stuck, or the workload manager could take too long to respond to a batch submission. This results in wasted compute hours for the user, or, sometimes, CI/CD runs timing out without exporting results.

Implementation Strategy

  • [ ] Make block accept both bool and int values
  • [ ] Add a test which checks that a Model is killed when time expires
  • [ ] Add a (very large but not infinite) timeout to current blocking calls which could hang in tests (such as those interacting with the WLM)
  • [ ] Document the new API

al-rigazzi avatar Oct 09 '23 17:10 al-rigazzi

Matt E: Maybe an alternative consider RunSettings should have this.
Matt D: Have block be an integer so we don't have two optional parameters

mellis13 avatar Nov 03 '23 17:11 mellis13

@al-rigazzi Rework ticket to include discussion of ideal solution.

mellis13 avatar Nov 03 '23 17:11 mellis13

@mellis13 RunSettings could in principle have a timeout parameter, but I'm afraid it might confuse users who could intend it as a synonym of walltime or time in BatchSettings-derived classes.

al-rigazzi avatar Jan 12 '24 16:01 al-rigazzi