GEOS icon indicating copy to clipboard operation
GEOS copied to clipboard

[EPIC] ErrorManager

Open MelReyCG opened this issue 2 years ago • 1 comments

Main Goal

The goal of this EPIC is to add a component in GEOS that centralizes and manage error (and exceptions), provides structured error data, produces clear & comprehensive error outputs that are suitable for everyone (user / devs), and define a policy regarding errors and exceptions

Issues in this EPIC

  • [ ] 1. Complete the errors unit test (which must tests every types of errors GEOS can encounter)
    • Numeric errors, memory overflow, IO errors,
    • Stacked exceptions, exception / error while catching an exception
    • MPI errors
    • ...

  • [ ] 2. Create the ErrorManager class, which :
    • Provides a centralized point to throw and manage the GEOS errors / exceptions,
    • Is based on structured error data rather than only texts,
    • Must be reliable,
    • Produces clear console outputs (not comprehensive, depending on the user type),
    • Produces a generated error data file that contain all error data (JSON format? One per ranks, grouped in a sub folder?),
    • Has only GEOS_HOST methods, to ensure that only CPUs can throw / manage errors.

The error data structure can contain:

  • Error message,
  • timestamp,
  • Location in the code,
  • Group / Wrapper that sent the message, if applicable (name + xml location / path in hierarchy),
  • TimeStep, convergence step and converged attribute,
  • MPI rank,
  • Parent exception data,
  • (don't hesitate to suggest more data)

  • [ ] 3. Factorize errors that come from multiple ranks, either synchronously or by postprocessing the generated error data file.

The goal here is to solve this classic problem : Let's consider GEOS ran on 2048 ranks, and the rank 407 thrown an error because of a local issue. Then the ranks 203, 358, 1017 and 1502 thrown another error because of ghosting cells, and all the other ranks sent MPI_ABORT errors. In this situation, we can only hope that every everything outputs in that order in the log, but it is not guaranteed.

The solution I would like to propose is to process the error data files either : a) If possible, when a crash occurs, the rank 0 will then collect & factorize any error data files from other ranks and output it in the stdout, b) After the complete GEOS shutdown, by launching geos or a dedicated executable / script on the generated error data files folder. Because of HPC considerations, the a) method could be enabled by adding a command line parameter.


  • [ ] 4. Properly manage TPL errors,
    • those that are managed with the GEOS_LAI_CHECK_ERROR() define
    • CUDA errors (GEOS_HYPRE_CHECK_DEVICE_ERRORS(), cudaGetLastError())

  • [ ] 5. All errors from the unit test must be properly interfaced with python / pygeos

  • [ ] 6.1. Add a section in the documentation to describe "How to generate an error / an exception". What is acceptable and what is not in the GEOS code.

The following practices are banned :

  • Recovering from an exception. Exception can only be catched by higher function in the call-stack to add more information to them (and potentially stack exceptions).

  • Throwing any error / exception or writing any log from a GEOS_HOST_DEVICE context.

    • If any code can run on GPU, the error /warning state should be reported to the CPU. For instance, if a variable should throw an error if negative, the good practice is to collect its minimal value with RAJA::ReduceMin and read it from the host context to write a proper contextualized message.
    • because of the memory impact, any call to CUDA printf() is banned.
  • ... (don't hesitate to suggest more)

  • [ ] 6.2. Ensure that the errors / exceptions practices are in place in GEOS.

    • Remove any possibility to add an error / a log from the GPU,
    • Search where warning could be used rather than logs in the code.

MelReyCG avatar Jan 17 '24 17:01 MelReyCG

@rrsettgast

jeannepellerin avatar Feb 21 '24 19:02 jeannepellerin