[EPIC] ErrorManager
Main Goal
The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions
Issues in this EPIC
- [ ] 1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
- Numeric errors, memory overflow, IO errors,
- Stacked exceptions, exception / error while catching an exception
- MPI errors
- ...
- [ ] 2. Create the ErrorManager class, which :
- Provides a centralized point to throw and manage the GEOS errors / exceptions,
- Is based on structured error data rather than only texts,
- Must be reliable,
- Produces clear console outputs (not comprehensive, depending on the user type),
- Produces a generated error data file that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
- Has only
GEOS_HOSTmethods, to ensure that only CPUs can throw / manage errors.
The error data structure can contain:
- Error message,
- timestamp,
- Location in the code,
- Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
- TimeStep, convergence step and converged attribute,
- MPI rank,
- Parent exception data,
- … (don't hesitate to suggest more data)
- [ ] 3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.
The goal here is to solve this classic problem : Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed.
The solution I would like to propose is to process the error data files either :
a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout,
b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder.
Because of HPC considerations, the a) method could be enabled by adding a command line parameter.
- [ ] 4. Properly manage TPL errors,
- those that are managed with the GEOS_LAI_CHECK_ERROR() define
- CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())
- [ ] 5. All errors from the unit test must be properly interfaced with python /
pygeos
- [ ] 6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.
The following practices are banned :
-
Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).
-
Throwing any error / exception or writing any log from a
GEOS_HOST_DEVICEcontext.- If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with
RAJA::ReduceMinand read it from the host context to write a proper contextualized message. - because of the memory impact, any call to CUDA
printf()is banned.
- If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with
-
... (don't hesitate to suggest more)
-
[ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.
- Remove any possibility to add an error / a log from the GPU,
- Search where warning could be used rather than logs in the code.
@rrsettgast