ascent icon indicating copy to clipboard operation
ascent copied to clipboard

build recipe requests for summit and frontier

Open cyrush opened this issue 2 years ago • 11 comments

frontier public install for WarpX

Compatible with their new process.

https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html

Also add info on how to build to WarpX Docs

NekRS requests (2023/08/04)

gnu + cuda builds Summit

Summit (mpicc/mpic++/mpif77)

module load gcc make cuda

 1) lsf-tools/2.0   3) xalt/1.2.1   5) git-lfs/2.11.0   7) cmake/3.23.2              9) nsight-systems/2021.3.1.54  11) spectrum-mpi/10.4.0.3-20210112
 2) hsi/5.0.2.p5    4) DefApps      6) gcc/9.1.0        8) nsight-compute/2021.2.1  10) cuda/11.0.3

gnu + hip builds on Frontier

Frontier (cc/CC/ftn) module load PrgEnv-gnu module load craype-accel-amd-gfx90a module load cray-mpich module load rocm module unload cray-libsci

 1) craype-x86-trento        5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta   9) craype/2.7.19          13) hsi/default              17) rocm/5.3.0
 2) libfabric/1.15.2.0       6) cray-pmi/6.1.8                         10) cray-dsmml/0.2.2       14) DefApps/default
 3) craype-network-ofi       7) gnuplot/5.4.3                          11) PrgEnv-gnu/8.3.3       15) craype-accel-amd-gfx90a
 4) perftools-base/22.12.0   8) gcc/12.2.0                             12) darshan-runtime/3.4.0  16) cray-mpich/8.1.23

cyrush avatar Aug 22 '23 22:08 cyrush

Updates

  • WarpX on Frontier: Able to update to required modules from here https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html#frontier-olcf but with cce/16.0.1

  • NekRS on Frontier: hip flags being added to non hip compilations leading to the following error

[ 12%] Building C object blt/tests/smoke/CMakeFiles/blt_hip_runtime_c_smoke.dir/blt_hip_runtime_c_smoke.c.o
cd /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/build/camp-2022.10.1/blt/tests/smoke && /opt/cray/pe/craype/2.7.19/bin/CC -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -isystem /opt/rocm-5.3.0/include -isystem /opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.. -Wall -Wextra      -O3 -DNDEBUG -fPIE --rocm-path=/opt/rocm-5.3.0 -x hip --offload-arch=gfx90a -std=c++17 -MD -MT blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -MF CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o.d -o CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -c /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/camp-2022.10.1/extern/blt/tests/smoke/blt_hip_smoke.cpp
g++: error: unrecognized command-line option '--rocm-path=/opt/rocm-5.3.0'
g++: error: unrecognized command-line option '--offload-arch=gfx90a'
  • NekRS on Summit: Getting the following error, realized we need to use a newer cuda (cuda/11.1.1) ptxas fatal : Unresolved extern function from here https://github.com/LLNL/RAJA/blob/e78b1eb03cbcd9f954c9f54ea79b5f6f479bde45/include/RAJA/pattern/params/forall.hpp#L70

nicolemarsaglia avatar Sep 28 '23 18:09 nicolemarsaglia

NekRS on Frontier building Camp:

[ 22%] Building CXX object blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o
cd /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/build/camp-2022.10.1/blt/tests/smoke && /opt/cray/pe/craype/2.7.19/bin/CC -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -isystem /opt/rocm-5.3.0/include -isystem /opt/rocm-5.3.0/llvm/lib/clang/15.0.0/include/.. -Wall -Wextra      -O3 -DNDEBUG -fPIE --rocm-path=/opt/rocm-5.3.0 -x hip --offload-arch=gfx90a -std=c++17 -MD -MT blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -MF CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o.d -o CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o -c /autofs/nccs-svm1_sw/summit/ums/ums010/2023_01/frontier/ascent_nekrs/camp-2022.10.1/extern/blt/tests/smoke/blt_hip_smoke.cpp
g++: error: unrecognized command-line option '--rocm-path=/opt/rocm-5.3.0'
g++: error: unrecognized command-line option '--offload-arch=gfx90a'
make[2]: *** [blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/build.make:79: blt/tests/smoke/CMakeFiles/blt_hip_smoke.dir/blt_hip_smoke.cpp.o] Error 1

nicolemarsaglia avatar Sep 28 '23 21:09 nicolemarsaglia

Bad news for gnu + hip on Frontier. Helpful info from our friend Ryan at OLCF: ..."the CC compiler wrapper for PrgEnv-gnu doesn't support HIP, because gcc (unlike clang) doesn't have support for HIP yet."

nicolemarsaglia avatar Oct 04 '23 20:10 nicolemarsaglia

@mvictoras Unfortunately we haven't had the greatest success with these builds. We are road blocked on Frontier because the PrgEnv-gnu compiler wrappers do not support HIP. On Summit, I was able to get a build with a newer cuda version but the majority of my tests are failing with a cuda device error in vtkm.

nicolemarsaglia avatar Oct 05 '23 22:10 nicolemarsaglia

@nicolemarsaglia I am able to run NekRS + Ascent on Frontier with the ascent module.

Here is the module I use

module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich
module load rocm
module load ascent/0.8.0
module unload cray-libsci

module list

export MPICH_GPU_SUPPORT_ENABLED=1

Currently Loaded Modules:
  1) craype-x86-trento                      10) PrgEnv-gnu/8.3.3
  2) libfabric/1.15.2.0                     11) darshan-runtime/3.4.0
  3) craype-network-ofi                     12) hsi/default
  4) perftools-base/22.12.0                 13) DefApps/default
  5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta  14) craype-accel-amd-gfx90a
  6) cray-pmi/6.1.8                         15) cray-mpich/8.1.23
  7) gcc/12.2.0                             16) rocm/5.3.0
  8) craype/2.7.19                          17) ascent/0.8.0
  9) cray-dsmml/0.2.2

I'm also using my own branch of NekRS which is based on our latest release, v23. Let me know if you need any further information.

yslan avatar Oct 23 '23 00:10 yslan

@yslan thanks for the info! I'm shocked there is an ascent module on Frontier. Unfortunately, ascent/0.8.0 will not have HIP/GPU support, but ascent/0.9.0 does, though that version is missing some key performance fixes.

nicolemarsaglia avatar Oct 24 '23 20:10 nicolemarsaglia

ascent/0.8.0 will not have HIP/GPU support

Hmm.... I have been running NekRS + Ascent on Frontier up to 75 Frontier nodes, and it runs pretty well.

NekRS is running on GPU for sure and I found from our interface that we pass the GPU pointer to Ascent. I have hard time believing it can get the data if Ascent is running on the host.

Need @mvictoras for double checking what is actually happening.

On the other hand, do you happen to know which version of Ascent is in that module? I can find the path to the installed location but I can't find the source code.

/sw/frontier/spack-envs/base/opt/cray-sles15-zen3/gcc-12.2.0/ascent-0.8.0-6j27g2kx4a3zpg5ojh27ffhqsuurodzy/

yslan avatar Oct 24 '23 21:10 yslan

@yslan those are facility builds created with spack, so I think spack source stage is probably gone.

CUDA vs HIP runtimes are different with respect GPU vs host access pitfalls.

You could confirm by running a profiler to look at GPU work.

Note: We have only been using build_ascent for HIP builds. We want to have spack support for HIP, but it was changing so rapidly we had to have a stable way to build for Frontier.

cyrush avatar Oct 24 '23 21:10 cyrush

It looks like the one I was using is rendered with OpenMP Offload. Screenshot-20231026022833-1633x770

yslan avatar Oct 27 '23 18:10 yslan

I see - I think it is using OpenMP on the CPU not GPU. GPU build should improve performance.

cyrush avatar Oct 30 '23 16:10 cyrush

I think it is using OpenMP on the CPU not GPU.

Is there anyway to confirm this? On my end, I will try to setup timer and build our own benchmark.

GPU build should improve performance

For HIP, Camp's build system seems to only support LLVM right now and we need GNU.

We only sent a GPU pointer to Ascent. Does OpenMP manage to use that to automatically run on CPU?

yslan avatar Oct 30 '23 16:10 yslan