Castro icon indicating copy to clipboard operation
Castro copied to clipboard

Implement GPU performance checker

Open maximumcats opened this issue 5 years ago • 1 comments

We should have logic in the code that detects if we are likely going to have poor GPU performance, and aborts the run if so. The simplest logic here is just to check if all of the GPU memory is allocated. The user would get a warning message saying they should restart the run with more GPUs (or run a smaller problem). If the user really wants to run in this way, we'll implement an "expert mode" runtime option that allows them to override this constraint.

The only subtlety here is how to actually stop the run. Should it be a hard crash (i.e. amrex::Error()) or a graceful one (i.e. allow the timestep to complete, and write a checkpoint)? Many of us use job scripts that chain jobs at HPC centers -- how can we implement this in a way where those job scripts can easily detect that a run has been stopped for this reason and we should no longer keep chaining?

maximumcats avatar Feb 29 '20 21:02 maximumcats

I think that we effectively have this now, since we abort if we go over memory.

zingale avatar Jul 22 '22 15:07 zingale