Generating multiple meshes in parallel
Hi all,
I have noticed recently that the InternalMeshGenerator crashes in parellel runs when I have more than one meshBody.
Could someone with a better understanding of mesh generation check what could be going on? There is a test file in the inputFiles folder (which should actually be an integratedTest), which one can use to replicate the error. Just run
srun -n 4 geosx -i inputFiles/multipleMeshBodies/testMultipleBodies.xml -x 2 -y 2 -z 1
I am getting the following error:
[macieira@quartz1154:/usr/workspace/macieira/new_geosx_folder/GEOScopy/GEOS]$srun -p pdebug -n 4 build-quartz-clang\@14-debug/bin/geosx -i inputFiles/multipleMeshBodies/testMultipleBodies.xml -x 2 -y 2
srun: job 480016 queued and waiting for resources
srun: job 480016 has been allocated resources
Num ranks: 4
Max threads: 9
MKL max threads: 9
GEOSX version: 0.2.0 (feature/andrembcosta/arbitraryFractureShape, sha1: 7db3aee18)
- c++ compiler: clang 14.0.6
- openmp version: 201811
- MPI version: MVAPICH2 Version : 2.3.7
MVAPICH2 Release date : Wed March 02 22:00:00 EST 2022
MVAPICH2 Device : ch3:psm
MVAPICH2 configure : --prefix=/usr/tce/backend/installations/linux-rhel8-x86_64/clang-14.0.6/mvapich2-2.3.7-x3u23fmm2xvki3kxsugjhuxeblakeame --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --enable-wrapper-rpath=yes --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pm=hydra --with-device=ch3:psm --with-psm2=/usr --with-file-system=lustre+nfs+ufs --enable-llnl-site-specific-options --enable-debuginfo
MVAPICH2 CC : /usr/tce/spack/lib/spack/env/clang/clang -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX : /usr/tce/spack/lib/spack/env/clang/clang++ -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77 : /usr/tce/spack/lib/spack/env/clang/gfortran -O2
MVAPICH2 FC : /usr/tce/spack/lib/spack/env/clang/gfortran -O2
- HDF5 version: 1.12.1
- Conduit version: 0.8.2
- VTK version: 9.1.0
- RAJA version: 2022.3.0
- umpire version: 2022.3.0
- adiak version: ..
- caliper version: 2.8.0
- METIS version: 5.1.0
- PARAMETIS version: 4.0.3
- scotch version: 6.0.9
- superlu_dist version: 6.3.0
- suitesparse version: 5.7.9
- Python3 version: 3.10.8
- hypre release version: 2.28.0
Started at 2023-06-08 17:40:01
Adding Solver of type SolidMechanicsLagrangianSSLE, named SolidMechSolveBody1
Adding Solver of type SolidMechanicsLagrangianSSLE, named SolidMechSolveBody2
Adding Mesh: InternalMesh, body1
Adding Mesh: InternalMesh, body2
Adding Event: PeriodicEvent, solverApplications1
Adding Event: PeriodicEvent, solverApplications2
Adding Event: PeriodicEvent, outputs
Adding Event: PeriodicEvent, siloOutputs
Adding Event: PeriodicEvent, restarts
TableFunction: f_b
Adding Output: Silo, MultiBodyTest_SiloOutput
Adding Output: VTK, MultiBodyTest_VTKOutput
Adding Output: Restart, restartOutput
Adding Object CellElementRegion named body1_cer from ObjectManager::Catalog.
Adding Object CellElementRegion named body2_cer from ObjectManager::Catalog.
body1: total number of nodes = 72
body1: total number of elems = 25
body2: total number of nodes = 32
body2: total number of elems = 9
***** ERROR
***** LOCATION: /usr/WS2/macieira/new_geosx_folder/GEOScopy/GEOS/src/coreComponents/LvArray/src/ArrayView.hpp:548
***** Controlling expression (should be false): index < 0 || index >= m_dims[ 0 ]
Array Bounds Check Failed: index=2 m_dims[0]=2
** StackTrace of 13 frames **
Frame 0: std::enable_if<(1)==(1), double&>::type LvArray::ArrayView<double, 1, 0, int, LvArray::ChaiBuffer>::operator[]<1>(int) const &
Frame 1: void geos::InternalMeshGenerator::getNodePosition<LvArray::ArraySlice<double, 1, 0, int> >(int const (&) [3], int, LvArray::ArraySlice<double, 1, 0, int>&&)
Frame 2: geos::InternalMeshGenerator::generateMesh(geos::DomainPartition&)
Frame 3: /usr/WS2/macieira/new_geosx_folder/GEOScopy/GEOS/build-quartz-clang@14-debug/lib/libgeosx_core.so
Frame 4: /usr/WS2/macieira/new_geosx_folder/GEOScopy/GEOS/build-quartz-clang@14-debug/lib/libgeosx_core.so
Frame 5: /usr/WS2/macieira/new_geosx_folder/GEOScopy/GEOS/build-quartz-clang@14-debug/lib/libgeosx_core.so
Frame 6: /usr/WS2/macieira/new_geosx_folder/GEOScopy/GEOS/build-quartz-clang@14-debug/lib/libgeosx_core.so
Frame 7: geos::MeshManager::generateMeshes(geos::DomainPartition&)
Frame 8: geos::ProblemManager::generateMesh()
Frame 9: geos::ProblemManager::problemSetup()
Frame 10: geos::GeosxState::initializeDataRepository()
Frame 11: main
Frame 12: __libc_start_main
Frame 13: _start
=====
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
Apparently, the partitioning of the second meshBody isn't working properly, with rank0 getting all the data and the other ranks getting nothing.
EDIT: Apparently, this only happens if the mesh coordinates go into the negative range.
see also https://github.com/GEOS-DEV/GEOS/issues/2042