GEOS icon indicating copy to clipboard operation
GEOS copied to clipboard

GPU GEOS : Mismatched template definition usage in coreComponents/physicsSolvers

Open drmichaeltcvx opened this issue 1 year ago • 13 comments

Describe the bug Building GEOS (latest "develop" and TPL "master") with CUDA, runs into conflicting definitions for certain templates.

To Reproduce Steps to reproduce the behavior:

  1. Build TPL and then GEOS with GCC 13.2.0 and CUDA 11.8 or 12.5, or 12.6.
  2. Click on '....'
  3. Scroll down to '....'
  4. See error Try to provide minimal test cases where possible to help isolate the problem.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Platform (please complete the following information):

  • Machine AMD node CPU : AMD EPYC 7V12 and GPU: A100 A100-SXM4-80GB
  • Compiler: Host: gcc 13.2.0, GPU : CUDA 11.8 or 12.5, or 12.6
  • GEOS Version : v1.1.0 commit 93f0252a19f240dc3b7375cd93bbe036ac318063

Additional context Output from make all

[ 86%] Building CUDA object coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustic/secondOrderEqn/anisotropic/AcousticVTIWaveEquationSEM.cpp.o
cd /dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++ -DCAMP_HAVE_CUDA -DDIY_NO_THREADS -DFMT_HEADER_ONLY=1 -DH5_BUILT_AS_DYNAMIC_LIB -DMPICH_SKIP_MPICXX -DMPI_NO_CPPBIND -DOMPI_SKIP_MPICXX -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_MPICC_H -D_POSIX_C_SOURCE=200809L -Dkiss_fft_scalar=double -DphysicsSolvers_EXPORTS --options-file CMakeFiles/physicsSolvers.dir/includes_CUDA.rsp -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations   -g -lineinfo    -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -ftz=true -O3 -DNDEBUG -Xcompiler -DNDEBUG -Xcompiler -Ofast -ftz=true   -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC -Xcompiler=-fopenmp -Xcompiler=-pthread -Xcompiler -pthread -MD -MT coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustic/secondOrderEqn/anisotropic/AcousticVTIWaveEquationSEM.cpp.o -MF CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustic/secondOrderEqn/anisotropic/AcousticVTIWaveEquationSEM.cpp.o.d -x cu -rdc=true -c /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/physicsSolvers/wavePropagation/sem/acoustic/secondOrderEqn/anisotropic/AcousticVTIWaveEquationSEM.cpp -o CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustic/secondOrderEqn/anisotropic/AcousticVTIWaveEquationSEM.cpp.o
[ 86%] Building CUDA object coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustoelastic/secondOrderEqn/isotropic/AcousticElasticWaveEquationSEM.cpp.o
cd /dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++ -DCAMP_HAVE_CUDA -DDIY_NO_THREADS -DFMT_HEADER_ONLY=1 -DH5_BUILT_AS_DYNAMIC_LIB -DMPICH_SKIP_MPICXX -DMPI_NO_CPPBIND -DOMPI_SKIP_MPICXX -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_MPICC_H -D_POSIX_C_SOURCE=200809L -Dkiss_fft_scalar=double -DphysicsSolvers_EXPORTS --options-file CMakeFiles/physicsSolvers.dir/includes_CUDA.rsp -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations   -g -lineinfo    -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -ftz=true -O3 -DNDEBUG -Xcompiler -DNDEBUG -Xcompiler -Ofast -ftz=true   -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC -Xcompiler=-fopenmp -Xcompiler=-pthread -Xcompiler -pthread -MD -MT coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustoelastic/secondOrderEqn/isotropic/AcousticElasticWaveEquationSEM.cpp.o -MF CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustoelastic/secondOrderEqn/isotropic/AcousticElasticWaveEquationSEM.cpp.o.d -x cu -rdc=true -c /dev/shm/mtml/src/GEOS/GEOS/src/coreComponents/physicsSolvers/wavePropagation/sem/acoustoelastic/secondOrderEqn/isotropic/AcousticElasticWaveEquationSEM.cpp -o CMakeFiles/physicsSolvers.dir/wavePropagation/sem/acoustoelastic/secondOrderEqn/isotropic/AcousticElasticWaveEquationSEM.cpp.o

[ 86%] Linking CUDA device code CMakeFiles/physicsSolvers.dir/cmake_device_link.o
cd /dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /data/saet/mtml/software/x86_64/cmake-3.28.3-linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/physicsSolvers.dir/dlink.txt --verbose=1
/vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++   -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations   -g -lineinfo    -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -ftz=true -O3 -DNDEBUG -Xcompiler -DNDEBUG -Xcompiler -Ofast -ftz=true   "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fopenmp -Xcompiler=-L/vend/nvidia/cuda/v12.6/lib64 -Xlinker=-rpath -Xlinker=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.20-gcc-mlnx_ofed-redhat8-cuda12-x86_64/ompi/lib -Xlinker=--enable-new-dtags -Xcompiler=-pthread -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink --options-file CMakeFiles/physicsSolvers.dir/deviceObjects1.rsp --options-file CMakeFiles/physicsSolvers.dir/deviceObjects2.rsp -o CMakeFiles/physicsSolvers.dir/cmake_device_link.o --options-file CMakeFiles/physicsSolvers.dir/deviceLinkLibs.rsp
nvlink error   : Size doesn't match for '_ZN4geos13finiteElement17FiniteElementBaseC1ERKS1_$824' in 'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DelftEgg-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o', first specified in 'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DruckerPragerExtended-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o' (target: sm_80)
nvlink fatal   : merge_elf failed (target: sm_80)
make[2]: *** [coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build.make:13103: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/cmake_device_link.o] Error 1
make[2]: Target 'coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build' not remade because of errors.
make[2]: Leaving directory '/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make[1]: *** [CMakeFiles/Makefile2:8964: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/all] Error 2
make[1]: Target 'all' not remade because of errors.
make[1]: Leaving directory '/dev/shm/mtml/src/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make: *** [Makefile:146: all] Error 2
[Make-build-2024-12-17-103544.log](https://github.com/user-attachments/files/18171272/Make-build-2024-12-17-103544.log)
[PoromechanicsKernels_CellElementSubRegion_PorousSolid-DruckerPragerExtended-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.txt](https://github.com/user-attachments/files/18171302/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DruckerPragerExtended-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.txt)
[ThermoPoromechanicsKernels_CellElementSubRegion_PorousSolid-DelftEgg-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.txt](https://github.com/user-attachments/files/18171303/ThermoPoromechanicsKernels_CellElementSubRegion_PorousSolid-DelftEgg-_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.txt)

drmichaeltcvx avatar Dec 17 '24 20:12 drmichaeltcvx

When we force set CALC_FEM_SHAPE_IN_KERNEL we get these messages during compilation :

[ 42%] Building CUDA object coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/testFiniteElementBase.cpp.o
cd /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-OPTO3-Hypre-GCC-CUDA-MPI-OMP-relwithdebinfo/coreComponents/finiteElement/unitTests && /vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++ -DCAMP_HAVE_CUDA -DFMT_HEADER_ONLY=1 -DGTEST_HAS_DEATH_TEST=1 -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_POSIX_C_SOURCE=200809L -DtestFiniteElementBase_EXPORTS --options-file CMakeFiles/testFiniteElementBase.dir/includes_CUDA.rsp -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations -DGEOS_USE_DEVICE   -g -lineinfo  -O3 -ftz=true -Xcompiler -O3 -Xcompiler -fno-fast-math -Xcompiler -mdaz-ftz -DNDEBUG   -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIE -Xcompiler=-fopenmp -Xcompiler=-pthread -MD -MT coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/testFiniteElementBase.cpp.o -MF CMakeFiles/testFiniteElementBase.dir/testFiniteElementBase.cpp.o.d -x cu -rdc=true -c /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp -o CMakeFiles/testFiniteElementBase.dir/testFiniteElementBase.cpp.o
In file included from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/common/DataTypes.hpp:27,
                 from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/elementFormulations/FiniteElementBase.hpp:29,
                 from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp:17:
/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-OPTO3-Hypre-GCC-CUDA-MPI-OMP-relwithdebinfo/include/common/GeosxConfig.hpp:199: warning: "GEOS_USE_DEVICE" redefined
  199 | #define GEOS_USE_DEVICE
      | 
<command-line>: note: this is the location of the previous definition
In file included from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/common/DataTypes.hpp:27,
                 from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/elementFormulations/FiniteElementBase.hpp:29,
                 from /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp:17:
/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-OPTO3-Hypre-GCC-CUDA-MPI-OMP-relwithdebinfo/include/common/GeosxConfig.hpp:199: warning: "GEOS_USE_DEVICE" redefined
  199 | #define GEOS_USE_DEVICE
      | 
<command-line>: note: this is the location of the previous definition
/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(116): error: "parallelDevicePolicy" is ambiguous
    forAll< parallelDevicePolicy<> >( 1, [ feBase, gradNDimsView, detJDimsView ]( int const i )
            ^

/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(116): error: expected an expression
    forAll< parallelDevicePolicy<> >( 1, [ feBase, gradNDimsView, detJDimsView ]( int const i )
                                 ^

/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(116): error: expected an identifier
    forAll< parallelDevicePolicy<> >( 1, [ feBase, gradNDimsView, detJDimsView ]( int const i )
                                   ^

/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(116): error: expected a type specifier
    forAll< parallelDevicePolicy<> >( 1, [ feBase, gradNDimsView, detJDimsView ]( int const i )
                                      ^

/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(116): error: expected a type specifier
    forAll< parallelDevicePolicy<> >( 1, [ feBase, gradNDimsView, detJDimsView ]( int const i )
                                         ^

/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp(134): warning #12-D: parsing restarts here after previous syntax error
    } );
      ^

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

5 errors detected in the compilation of "/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src/coreComponents/finiteElement/unitTests/testFiniteElementBase.cpp".
make[2]: *** [coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/build.make:77: coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/testFiniteElementBase.cpp.o] Error 2
make[2]: Target 'coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/build' not remade because of errors.
make[2]: Leaving directory '/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-OPTO3-Hypre-GCC-CUDA-MPI-OMP-relwithdebinfo'
make[1]: *** [CMakeFiles/Makefile2:7925: coreComponents/finiteElement/unitTests/CMakeFiles/testFiniteElementBase.dir/all] Error 2

drmichaeltcvx avatar Jan 17 '25 19:01 drmichaeltcvx

I used the branch 'testing/cusini/deactivate-some-kernels' Matteo prepared (https://github.com/GEOS-DEV/GEOS/pull/3516)

Unfortunately, the base class FiniteElementBase mismatch error now moved to another pair of derived classes:

/vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++   -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -O3 -ftz=true -Xcompiler -O3 -Xcompiler -fno-fast-math -Xcompiler -mdaz-ftz -DNDEBUG   "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fopenmp -Xcompiler=-L/vend/nvidia/cuda/v12.6/lib64 -Xlinker=-rpath -Xlinker=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/lib -Xlinker=--enable-new-dtags -Xcompiler=-pthread -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink --options-file CMakeFiles/physicsSolvers.dir/deviceObjects1.rsp --options-file CMakeFiles/physicsSolvers.dir/deviceObjects2.rsp -o CMakeFiles/physicsSolvers.dir/cmake_device_link.o --options-file CMakeFiles/physicsSolvers.dir/deviceLinkLibs.rsp
nvlink error   : Size doesn't match for '_ZN4geos13finiteElement17FiniteElementBaseC1ERKS1_$933' in 
'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/ThermoPoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o', first specified in
'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o' (target: sm_80)
nvlink fatal   : merge_elf failed (target: sm_80)

make[2]: *** [coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build.make:12168: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/cmake_device_link.o] Error 1
make[2]: Target 'coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build' not remade because of errors.
make[2]: Leaving directory '/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make[1]: *** [CMakeFiles/Makefile2:8964: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/all] Error 2
make[1]: Target 'all' not remade because of errors.
make[1]: Leaving directory '/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make: *** [Makefile:146: all] Error 2

drmichaeltcvx avatar Jan 17 '25 21:01 drmichaeltcvx

Here are excerpts from the failing build process with testing/cusini/deactivate-some-kernels


[ 61%] Building CUDA object coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o
cd /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++ -DCAMP_HAVE_CUDA -DDIY_NO_THREADS -DFMT_HEADER_ONLY=1 -DH5_BUILT_AS_DYNAMIC_LIB -DMPICH_SKIP_MPICXX -DMPI_NO_CPPBIND -DOMPI_SKIP_MPICXX -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_MPICC_H -D_POSIX_C_SOURCE=200809L -Dkiss_fft_scalar=double -DphysicsSolvers_EXPORTS --options-file CMakeFiles/physicsSolvers.dir/includes_CUDA.rsp -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -O3 -ftz=true -Xcompiler -O3 -Xcompiler -fno-fast-math -Xcompiler -mdaz-ftz -DNDEBUG   -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC -Xcompiler=-fopenmp -Xcompiler=-pthread -Xcompiler -pthread -MD -MT coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o -MF CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o.d -x cu -rdc=true -c /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp -o CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o

[ 63%] Building CUDA object coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o
cd /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++ -DCAMP_HAVE_CUDA -DDIY_NO_THREADS -DFMT_HEADER_ONLY=1 -DH5_BUILT_AS_DYNAMIC_LIB -DMPICH_SKIP_MPICXX -DMPI_NO_CPPBIND -DOMPI_SKIP_MPICXX -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D_MPICC_H -D_POSIX_C_SOURCE=200809L -Dkiss_fft_scalar=double -DphysicsSolvers_EXPORTS --options-file CMakeFiles/physicsSolvers.dir/includes_CUDA.rsp -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -O3 -ftz=true -Xcompiler -O3 -Xcompiler -fno-fast-math -Xcompiler -mdaz-ftz -DNDEBUG   -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC -Xcompiler=-fopenmp -Xcompiler=-pthread -Xcompiler -pthread -MD -MT coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o -MF CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o.d -x cu -rdc=true -c /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp -o CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o

[ 93%] Linking CUDA device code CMakeFiles/physicsSolvers.dir/cmake_device_link.o
cd /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo/coreComponents/physicsSolvers && /data/saet/mtml/software/x86_64/cmake-3.28.3-linux-x86_64/bin/cmake -E cmake_link_script CMakeFiles/physicsSolvers.dir/dlink.txt --verbose=1

/vend/nvidia/cuda/v12.6/bin/nvcc -forward-unknown-to-host-compiler -ccbin=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/bin/mpic++   -restrict -arch sm_80 --expt-extended-lambda --expt-relaxed-constexpr -Werror cross-execution-space-call,reorder,deprecated-declarations  -g -lineinfo  -O3 -ftz=true -Xcompiler -O3 -Xcompiler -fno-fast-math -Xcompiler -mdaz-ftz -DNDEBUG   "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fopenmp -Xcompiler=-L/vend/nvidia/cuda/v12.6/lib64 -Xlinker=-rpath -Xlinker=/chv/az_ussc_p/x86_64-rhel3/util/hpcx/hpcx-v2.21-gcc-doca_ofed-redhat8-cuda12-x86_64/ompi/lib -Xlinker=--enable-new-dtags -Xcompiler=-pthread -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink --options-file CMakeFiles/physicsSolvers.dir/deviceObjects1.rsp --options-file CMakeFiles/physicsSolvers.dir/deviceObjects2.rsp -o CMakeFiles/physicsSolvers.dir/cmake_device_link.o --options-file CMakeFiles/physicsSolvers.dir/deviceLinkLibs.rsp

nvlink error   : Size doesn't match for '_ZN4geos13finiteElement17FiniteElementBaseC1ERKS1_$933' in 
'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/ThermoPoromechanicsKernels_CellElementSubRegion_PorousSolid-Damage-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o', first specified in 
'CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/physicsSolvers/multiphysics/poromechanicsKernels/PoromechanicsKernels_CellElementSubRegion_PorousSolid-DamageSpectral-ElasticIsotropic--_H1_Hexahedron_Lagrange1_GaussLegendre2.cpp.o' (target: sm_80)
nvlink fatal   : merge_elf failed (target: sm_80)

make[2]: *** [coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build.make:12168: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/cmake_device_link.o] Error 1
make[2]: Target 'coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/build' not remade because of errors.
make[2]: Leaving directory '/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make[1]: *** [CMakeFiles/Makefile2:8964: coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/all] Error 2
make[1]: Target 'all' not remade because of errors.
make[1]: Leaving directory '/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo'
make: *** [Makefile:146: all] Error 2

Are the problems caused by some missed '__device__' definition between GEOS\src\coreComponents\physicsSolvers\multiphysics\poromechanicsKernels\MultiphasePoromechanics_impl.hpp and GEOS\src\coreComponents\physicsSolvers\multiphysics\poromechanicsKernels\SinglePhasePoromechanics_impl.hpp ?

drmichaeltcvx avatar Jan 21 '25 22:01 drmichaeltcvx

@drmichaeltcvx It is difficult for me to think that this would have any dependence on hardware (Intel vs AMD). I would think this has to be something in the software stack. The dockerfiles that generate our TPL environment for CI are here: https://github.com/GEOS-DEV/thirdPartyLibs/tree/master/docker

You should look on dockerhub for a base image with closest match to your Linux distribution/version. I don't know if you have provided the linux distribution you are on. Once you have a suitable base image with a hopefully equivalent software stack, adding the image should involve copying one of the dockerfiles, replacing the base image, and modifying the packages to mimic your software stack.

here are some examples of base images: https://hub.docker.com//ubuntu https://hub.docker.com/r/rockylinux/rockylinux https://hub.docker.com//fedora

rrsettgast avatar Jan 22 '25 18:01 rrsettgast

Thanks Randy,

How do you generate the CI image for GEOS? Do you have dockerfiles for this?

Michael

From: Randolph Settgast @.> Sent: Wednesday, January 22, 2025 12:15 PM To: GEOS-DEV/GEOS @.> Cc: Thomadakis, Michael @.>; Mention @.> Subject: [EXTERNAL] Re: [GEOS-DEV/GEOS] GPU GEOS : Mismatched template definition usage in coreComponents/physicsSolvers (Issue #3496)

Be aware this external email contains an attachment and/or link. Ensure the email and contents are expected. If there are concerns, please submit suspicious messages to the Cyber Intelligence Center using the Report Phishing button.

@drmichaeltcvxhttps://github.com/drmichaeltcvx It is difficult for me to think that this would have any dependence on hardware (Intel vs AMD). I would think this has to be something in the software stack. The dockerfiles that generate our TPL environment for CI are here: https://github.com/GEOS-DEV/thirdPartyLibs/tree/master/docker

You should look on dockerhub for a base image with closest match to your Linux distribution/version. I don't know if you have provided the linux distribution you are on. Once you have a suitable base image with a hopefully equivalent software stack, adding the image should involve copying one of the dockerfiles, replacing the base image, and modifying the packages to mimic your software stack.

here are some examples of base images: https://hub.docker.com//ubuntu https://hub.docker.com/r/rockylinux/rockylinux https://hub.docker.com//fedora

Reply to this email directly, view it on GitHubhttps://github.com/GEOS-DEV/GEOS/issues/3496#issuecomment-2607939904, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS6ZG2Q3WGOY53IRTB75GN32L7N3FAVCNFSM6AAAAABTZIN3K6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBXHEZTSOJQGQ. You are receiving this because you were mentioned.Message ID: @.@.>>

drmichaeltcvx avatar Jan 22 '25 22:01 drmichaeltcvx

I am providing here the configure and build logs for a failed GPU build. Let's go through these first to see if we can identify any useful information that could point to where the problem starts. TPL builds fine for GPUs on our s/w stack.

CMake command

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo -DENABLE_YAPF=OFF -DGEOSX_DIR=/data/saet/mtml/src/GEOS_miket/GEOS/GEOS/build-GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP-relwithdebinfo -DGEOSX_TPL_DIR=/data/saet/mtml/software/x86_64/RHEL8/GEOSTPL/1.1.0-miket__GPU-build-fix-2025-01-14/install-GPU-OPTO3-Hypre-GCC-CUDA-MPI-OMP-relwithdebinfo -C/data/saet/mtml/src/GEOS_mtml/GEOS/host-configs/CVX/GPU-Hypre-GCC-CUDA_12.6-ompi_hpcx-OMP.cmake /data/saet/mtml/src/GEOS_miket/GEOS/GEOS/src

  • Configure log

Config-2025-01-17-134202.log

  • build log

Make-build-2025-01-17-134202.log

Please comment

drmichaeltcvx avatar Jan 24 '25 20:01 drmichaeltcvx

you should try using static linking and see if the error persists.

You can add

set(GEOS_BUILD_SHARED_LIBS OFF CACHE BOOL "" FORCE)

to your host-config.

CusiniM avatar Jan 31 '25 17:01 CusiniM

Here are the configure and build logs for the all static linking approach.

Configure Config-2025-02-01-093152.log

Build Make-build-2025-02-01-093152.log

drmichaeltcvx avatar Feb 03 '25 04:02 drmichaeltcvx

Here are the configure and build logs for the all static linking approach.

Configure Config-2025-02-01-093152.log

Build Make-build-2025-02-01-093152.log

nothing seems to change. Okay, well, at this point 2 things need to happen:

  1. Try to reproduce it on a CI image because currently none of our images shows this error and we can't reproduce it on Lassen. It must be linked to the software stack used on the CVX systems.
  2. Figure out the merge of which PR triggered this behavior.

CusiniM avatar Feb 03 '25 05:02 CusiniM

Thanks Randy,

How do you generate the CI image for GEOS? Do you have dockerfiles for this?

Michael

Docker files are defined here and ci jobs are defined here for the tpls and here for geos. If you start by setting up the base image in the tpls repo I can help you with adding a CI job in the geos repo. The hard task will be finding an image to reproduce the error. It would be great if you had an image of one of the systems you are using but I suspect that this does not exist.

CusiniM avatar Feb 03 '25 05:02 CusiniM

Thanks, Mateo...

Yeap, no docker images for the standard HPC image I am using. We do not generate docker images. I will consult with our admins to try to get TPL/GEOS docker images for our s/w stack.

Actually the only thing that may be of interest to evaluate is our Linux OS kernel (we are using Alma8, an offshoot of RHEL8 running on Microsoft's virtualization engine on Azure). All other S/W stack components are standard: GCC (10, 11, 13) and CUDA 11.x, 12.x with latest CUDA drivers.

drmichaeltcvx avatar Feb 03 '25 15:02 drmichaeltcvx

BTW, I also checked the dependency files (coreComponents/physicsSolvers/CMakeFiles/physicsSolvers.dir/__/__/generatedSrc/coreComponents/*/*/*/*.o.d) to ensure that the FiniteElementBase.hpp is included in all of the generated class definitions and apparently none of them is missing it.

I was hoping that some of them would miss including it, but nope...

drmichaeltcvx avatar Feb 03 '25 16:02 drmichaeltcvx

https://github.com/GEOS-DEV/GEOS/pull/3725

paveltomin avatar Oct 01 '25 20:10 paveltomin