[EPIC] ErrorManager

Open MelReyCG opened this issue 2 years ago • 1 comments

Main Goal

The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions

Issues in this EPIC

[ ] 1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
- Numeric errors, memory overflow, IO errors,
- Stacked exceptions, exception / error while catching an exception
- MPI errors
- ...

[ ] 2. Create the ErrorManager class, which :
- Provides a centralized point to throw and manage the GEOS errors / exceptions,
- Is based on structured error data rather than only texts,
- Must be reliable,
- Produces clear console outputs (not comprehensive, depending on the user type),
- Produces a generated error data file that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
- Has only GEOS_HOST methods, to ensure that only CPUs can throw / manage errors.

The error data structure can contain:

Error message,
timestamp,
Location in the code,
Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
TimeStep, convergence step and converged attribute,
MPI rank,
Parent exception data,
… (don't hesitate to suggest more data)

[ ] 3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.

The goal here is to solve this classic problem : Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed.

The solution I would like to propose is to process the error data files either : a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout, b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder. Because of HPC considerations, the a) method could be enabled by adding a command line parameter.

[ ] 4. Properly manage TPL errors,
- those that are managed with the GEOS_LAI_CHECK_ERROR() define
- CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())

[ ] 5. All errors from the unit test must be properly interfaced with python / pygeos

[ ] 6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.

The following practices are banned :

Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
Throwing any error / exception or writing any log from a GEOS_HOST_DEVICE context.
- If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with RAJA::ReduceMin and read it from the host context to write a proper contextualized message.
- because of the memory impact, any call to CUDA printf() is banned.
... (don't hesitate to suggest more)
[ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.
- Remove any possibility to add an error / a log from the GPU,
- Search where warning could be used rather than logs in the code.

Jan 17 '24 17:01 MelReyCG

@rrsettgast

Feb 21 '24 19:02 jeannepellerin