experiments icon indicating copy to clipboard operation
experiments copied to clipboard

Abstraction and help to interpret job failure (OOM, exit code, etc)

Open nqn opened this issue 7 years ago • 0 comments

As an optimizer author, I would like some abstraction to help me interpret the status of jobs and for failure, a higher level indication of why it failed. Say, OOM, image failure, exit code - as this will determine the strategy for continuing scheduling jobs. In the case of image failure, this will most likely be wrong for all subsequent jobs and the optimizer should exit. For OOM, a certain parameter subspace may be infeasible to run and should be avoided and for exit code, may indicate faulty code and should be reported to the user as well.

nqn avatar May 02 '18 20:05 nqn