Fixes for building with LLVM / XL OpenMP offload
Various changes (including temporary) to make XL OpenMP offload build working
* todo: need to rebase this branch on latest hackathon_master
* todo: temporary changes to OpenAccHelper.cmake, needs refinement
* todo: see caliper linkling issue
* todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
based builds can use GPU offload.
* todo: hardcoded CXX flags for quick build
How to test this?
Outstanding issues?
- [ ] Building with XLC gives (57cecb59d051b83adbca41fbe49680521b6d719f):
make[2]: *** [coreneuron/CMakeFiles/coreneuron.dir/build.make:114: coreneuron/CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o] Error 1
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/io/mech_report.cpp:12:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/coreneuron.hpp:24:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/utils/randoms/nrnran123.h:42:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/philox.h:37:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/compilerfeatures.h:218:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/clangfeatures.h:91:
/m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/gccfeatures.h:48:10: fatal error: 'ppu_intrinsics.h' file not found
#include <ppu_intrinsics.h>
^~~~~~~~~~~~~~~~~~
- [ ] Running ringtest with XLC gives (57cecb59d051b83adbca41fbe49680521b6d719f):
(venv) [pkumbhar@login01 build]$ ./bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 390.8125, Min 390.8125, Avg 390.8125
Memory (MBs) : After MPI_Init : Max 390.8125, Min 390.8125, Avg 390.8125
Memory (MBs) : Before nrn_setup : Max 390.8125, Min 390.8125, Avg 390.8125
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
Setup Done : 0.13 seconds
Model size : 84.19 kB
Memory (MBs) : After nrn_setup : Max 398.8750, Min 398.8750, Avg 398.8750
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=100
GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 398.8750, Min 398.8750, Avg 398.8750
Memory (MBs) : After nrn_finitialize : Max 398.6875, Min 398.6875, Avg 398.6875
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
1587-163 Error encountered while attempting to execute on the target device 0. The program will stop.
free(): corrupted unsorted chunks
Issue seems to be with
net_buf_receive_ExpSyn(). If I comment it out then it runs further
Running for 0 msec gives:
(venv) [pkumbhar@login01 build]$ gdb --args ./bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8_0
...
(gdb) r
Starting program: /m100/home/userexternal/pkumbhar/CoreNeuron/build/bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-72.el8_1.1.ppc64le
Missing separate debuginfo for /cineca/prod/opt/compilers/cuda/11.2/none/compat/libcuda.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/e9e189d76f924564adf8c9d73eeb713d5b23d9.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".
[New Thread 0x7fffefcbd890 (LWP 128546)]
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)
[New Thread 0x7fffcfffd890 (LWP 128551)]
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 390.7500, Min 390.7500, Avg 390.7500
Memory (MBs) : After MPI_Init : Max 390.7500, Min 390.7500, Avg 390.7500
Memory (MBs) : Before nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
Setup Done : 0.00 seconds
Model size : 84.19 kB
Memory (MBs) : After nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=0
GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 390.7500, Min 390.7500, Avg 390.7500
Memory (MBs) : After nrn_finitialize : Max 390.5625, Min 390.5625, Avg 390.5625
psolve |=========================================================| t: 0.00 ETA: 0h00m00s
Solver Time : 4.69685e-05
Thread 1 "special-core" received signal SIGSEGV, Segmentation fault.
IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
180 int n = ml->nodecount;
Missing separate debuginfos, use: dnf debuginfo-install libgcc-8.3.1-4.5.el8.ppc64le libstdc++-8.3.1-4.5.el8.ppc64le
(gdb) bt
#0 IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
#1 0x000000001014c28c in coreneuron::update_nrnthreads_on_host () at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:831
#2 0x00000000100df048 in run_solve_core (argc=-14016, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/apps/main1.cpp:630
#3 0x000000001003a048 in solve_core (argc=-13968, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/enginemech.cpp:49
#4 0x0000000010039fd8 in main (argc=-13936, argv=0x7fff0000002e) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/coreneuron.cpp:14
if I comment out
update_nrnthreads_on_host()then it runs further.
CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,
On Ascent @ ORNL:
module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
module load xl/16.1.1-10
Configure XL for newer GCC:
xlc_configure -gcc /sw/ascent/gcc/10.2.0 -o /ccsopen/proj/gen170/neuron/nersc-gpu-hackaxlc_gcc10.cfg -cuda /sw/ascent/cuda/11.0.2 -cudaVersion 11.0 -cuda_cc_major 7 -cuda_cc_minor 0
And build with:
cmake .. -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0" -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="" -DCORENRN_ENABLE_UNIT_TESTS=OFF -DCMAKE_CXX_COMPILER=xlc++_r -DCORENRN_NMODL_DIR=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/kumbhar/nmodl/build/install -DCORENRN_ENABLE_MPI=ON
I rebased this and tried to resolve the conflicts fairly blindly.
Logfiles from GitLab pipeline #29782 (:no_entry:) have been uploaded here!
Status and direct links:
- :no_entry: build:coreneuron+nmodl:intel
- :no_entry: build:coreneuron:intel
- :no_entry: build:coreneuron+nmodl:gpu
- :no_entry: build:coreneuron+nmodl~openmp:gpu
- :no_entry: build:coreneuron:gpu
- :fast_forward: test:coreneuron+nmodl:intel
- :fast_forward: test:coreneuron:intel
- :fast_forward: test:coreneuron+nmodl:gpu
- :fast_forward: test:coreneuron+nmodl~openmp:gpu
- :fast_forward: test:coreneuron:gpu
- :fast_forward: build:neuron+nmodl:intel
- :fast_forward: build:neuron:intel
- :fast_forward: build:neuron+nmodl:gpu
- :fast_forward: build:neuron+nmodl~openmp:gpu
- :fast_forward: build:neuron:gpu
- :fast_forward: test:neuron+nmodl:intel
- :fast_forward: test:neuron:intel
- :fast_forward: test:neuron+nmodl:gpu
- :fast_forward: test:neuron+nmodl~openmp:gpu
- :fast_forward: test:neuron:gpu
Logfiles from GitLab pipeline #30397 (:no_entry:) have been uploaded here!
Status and direct links:
- :fast_forward: build:coreneuron+nmodl:intel
- :no_entry: build:coreneuron:intel
- :fast_forward: build:coreneuron+nmodl:gpu
- :fast_forward: build:coreneuron+nmodl~openmp:gpu
- :no_entry: build:coreneuron:gpu
- :fast_forward: test:coreneuron+nmodl:intel
- :fast_forward: test:coreneuron:intel
- :fast_forward: test:coreneuron+nmodl:gpu
- :fast_forward: test:coreneuron+nmodl~openmp:gpu
- :fast_forward: test:coreneuron:gpu
- :fast_forward: build:neuron+nmodl:intel
- :fast_forward: build:neuron:intel
- :fast_forward: build:neuron+nmodl:gpu
- :fast_forward: build:neuron+nmodl~openmp:gpu
- :fast_forward: build:neuron:gpu
- :fast_forward: test:neuron+nmodl:intel
- :fast_forward: test:neuron:intel
- :fast_forward: test:neuron+nmodl:gpu
- :fast_forward: test:neuron+nmodl~openmp:gpu
- :fast_forward: test:neuron:gpu
Logfiles from GitLab pipeline #30552 (:no_entry:) have been uploaded here!
Status and direct links:
- :fast_forward: build:coreneuron+nmodl:intel
- :no_entry: build:coreneuron:intel
- :fast_forward: build:coreneuron+nmodl:gpu
- :fast_forward: build:coreneuron+nmodl~openmp:gpu
- :no_entry: build:coreneuron:gpu
- :fast_forward: test:coreneuron+nmodl:intel
- :fast_forward: test:coreneuron:intel
- :fast_forward: test:coreneuron+nmodl:gpu
- :fast_forward: test:coreneuron+nmodl~openmp:gpu
- :fast_forward: test:coreneuron:gpu
- :fast_forward: build:neuron+nmodl:intel
- :fast_forward: build:neuron:intel
- :fast_forward: build:neuron+nmodl:gpu
- :fast_forward: build:neuron+nmodl~openmp:gpu
- :fast_forward: build:neuron:gpu
- :fast_forward: test:neuron+nmodl:intel
- :fast_forward: test:neuron:intel
- :fast_forward: test:neuron+nmodl:gpu
- :fast_forward: test:neuron+nmodl~openmp:gpu
- :fast_forward: test:neuron:gpu
Logfiles from GitLab pipeline #30663 (:no_entry:) have been uploaded here!
Status and direct links:
- :fast_forward: build:coreneuron+nmodl:intel
- :no_entry: build:coreneuron:intel
- :fast_forward: build:coreneuron+nmodl:gpu
- :fast_forward: build:coreneuron+nmodl~openmp:gpu
- :no_entry: build:coreneuron:gpu
- :fast_forward: test:coreneuron+nmodl:intel
- :fast_forward: test:coreneuron:intel
- :fast_forward: test:coreneuron+nmodl:gpu
- :fast_forward: test:coreneuron+nmodl~openmp:gpu
- :fast_forward: test:coreneuron:gpu
- :fast_forward: build:neuron+nmodl:intel
- :fast_forward: build:neuron:intel
- :fast_forward: build:neuron+nmodl:gpu
- :fast_forward: build:neuron+nmodl~openmp:gpu
- :fast_forward: build:neuron:gpu
- :fast_forward: test:neuron+nmodl:intel
- :fast_forward: test:neuron:intel
- :fast_forward: test:neuron+nmodl:gpu
- :fast_forward: test:neuron+nmodl~openmp:gpu
- :fast_forward: test:neuron:gpu
Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.
- On Slack thread, Olli reminded small reproducer from NERSC Hackathon showing issue with global variables and static library:
[olupton@r2i3n6 build_olli_gpu]$ cat test.sh
CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o
[olupton@r2i3n6 build_olli_gpu]$ cat test.cpp
#pragma omp declare target
int y;
#pragma omp end declare target
int test() {
y = 24;
#pragma omp target update to(y)
y = 42;
int x;
#pragma omp target map(from:x)
{
x = y;
}
return x;
}
[olupton@r2i3n6 build_olli_gpu]$ cat main.cpp
extern int test();
int main() {
return test();
}
small reproducer for the problem I am seeing now — going via the static library seems to cause problems :shrug:
[olupton@r2i3n6 build_olli_gpu]$ ./test1
CUDA error: Loading global 'y' Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
[olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $?
24
- So I switched to shared library in coreneuron + nrnivmodl-core i.e. build shared library of libcorenrnmech. This was giving undefined symbol errors for global variables & function defined in libcoreneuron.a:
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in
See also Olli's comment in NERSC GPU Hackathon Slack:
With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header
#pragma omp declare target
extern double celsius;
#pragma omp end declare target
and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup:
Program received signal SIGSEGV, Segmentation fault.
0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
(gdb) bt
#0 0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#1 0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#2 0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3 0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
still investigating..
- Then I avoided building libcorenrnmech library and tried to use mechanisms object files directly to create
special-core(by modifying nrnivmodl-core-makefile). This didn't go too far - I saw similar linking errors with global symbols in libcoreneuron library:
=> Binary creating x86_64/special-core
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \
-I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include \
x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o x86_64/corenrn/build/enginemech.o
nvlink error : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'
- Next, I wanted to avoid building any intermediate libraries i.e. create
special-corebinary by taking all objects. For this, I took mechanism cpp files generated bynrnivmodl-coreand added into coreneuron source tree (temporarily). Then, updated cmake to build special-core directly via cmake i.e. something like (partial diff):
+++ b/coreneuron/CMakeLists.txt
@@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config
set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp")
set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp")
+file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}")
+
# for external mod files we need to generate modl_ref function in mod_func.c
set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl")
@@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC)
set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>)
endif()
+set(COMPILE_LIBRARY_TYPE OBJECT)
+
# main coreneuron library
add_library(
coreneuron
@@ -319,7 +328,7 @@ add_custom_command(
"${modfile_directory}"
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin
COMMENT "Running nrnivmodl-core with halfgap.mod")
-add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
+#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
include_directories(${CORENEURON_PROJECT_SOURCE_DIR})
@@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h"
# main program required for building special-core
file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron)
+add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES})
+target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>)
+target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS)
+target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS})
This created exe using all object files:
[ 87%] Linking CXX executable ../bin/special-core-gpu
cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -fopenmp=libomp -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so
Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution:
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
I saw one such error reported here but that doesn't seem relevant.
- Next, I wanted to incrementally enable OpenMP offload part and find out what causes above error. So, first thing I did was to disable OpenMP offload parts from mechanisms cpp files i.e. simply something like:
cd coreneuron/exe # this directory contain exp2syn.cpp expsyn.cpp halfgap.cpp hh.cpp _kinderiv.h _mod_func.cpp netstim.cpp passive.cpp pattern.cpp stim.cpp svclmp.cpp
sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp
And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU! So the issue seems to be somehow related to mechanisms cpp files!
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
Setup Done : 0.00 seconds
Model size : 84.19 kB
Memory (MBs) : After nrn_setup : Max 296.4258, Min 296.4258, Avg 296.4258
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1
GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false
INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258
....
Memory (MBs) : After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.24223
Simulation Statistics
Number of cells: 20
Number of compartments: 804
Number of presyns: 21
Number of input presyns: 0
Number of synapses: 21
Number of point processes: 41
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 40
Number of spikes with non negative gid-s: 40
==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.68% 20.892ms 160 130.57us 126.98us 134.59us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
9.64% 17.250ms 80 215.63us 186.62us 222.72us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
8.98% 16.057ms 160 100.35us 99.007us 101.92us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
7.72% 13.816ms 80 172.70us 169.18us 174.59us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
6.87% 12.293ms 80 153.66us 152.13us 154.94us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
6.82% 12.200ms 80 152.50us 150.37us 154.30us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
6.07% 10.865ms 80 135.81us 135.23us 136.58us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
6.05% 10.826ms 80 135.33us 132.64us 136.80us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
6.04% 10.806ms 80 135.07us 133.41us 137.73us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
6.01% 10.746ms 80 134.32us 133.73us 135.33us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
5.94% 10.621ms 80 132.76us 132.35us 136.48us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
5.92% 10.589ms 80 132.36us 131.52us 133.31us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
5.84% 10.446ms 80 130.57us 127.07us 131.42us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
5.79% 10.358ms 80 129.48us 128.90us 132.19us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
0.33% 594.30us 6 99.050us 2.4320us 446.46us [CUDA memset]
0.18% 318.05us 246 1.2920us 1.2470us 1.7920us [CUDA memcpy HtoD]
0.10% 185.74us 125 1.4850us 1.3750us 2.9440us [CUDA memcpy DtoH]
API calls: 38.95% 214.73ms 1 214.73ms 214.73ms 214.73ms cuDevicePrimaryCtxRetain
38.43% 211.84ms 1450 146.09us 756ns 275.66us cuStreamSynchronize
9.61% 52.977ms 1 52.977ms 52.977ms 52.977ms cuModuleLoadDataEx
4.59% 25.275ms 1 25.275ms 25.275ms 25.275ms cuModuleUnload
3.69% 20.328ms 6 3.3880ms 8.9020us 20.278ms cudaMallocManaged
2.10% 11.594ms 84 138.03us 11.220us 146.78us cuMemcpyDtoHAsync
1.17% 6.4369ms 1280 5.0280us 4.2990us 17.611us cuLaunchKernel
0.30% 1.6383ms 407 4.0250us 132ns 194.16us cuDeviceGetAttribute
0.28% 1.5530ms 4 388.24us 385.38us 395.38us cuDeviceTotalMem
0.26% 1.4583ms 1619 900ns 161ns 581.21us cuCtxSetCurrent
0.19% 1.0351ms 246 4.2070us 3.6880us 13.495us cuMemcpyHtoDAsync
0.15% 844.22us 6 140.70us 26.971us 516.61us cudaMemset
0.11% 591.25us 41 14.420us 11.523us 44.503us cuMemcpyDtoH
0.07% 407.33us 32 12.729us 1.7660us 190.68us cuStreamCreate
0.03% 172.74us 4 43.183us 36.483us 57.844us cuDeviceGetName
0.02% 125.94us 32 3.9350us 2.2320us 27.967us cuStreamDestroy
0.02% 95.955us 373 257ns 144ns 4.1180us cuGetProcAddress
0.01% 44.318us 54 820ns 292ns 5.4780us cuModuleGetGlobal
0.01% 30.319us 41 739ns 377ns 4.3020us cuModuleGetFunction
0.00% 16.994us 1 16.994us 16.994us 16.994us cuMemAlloc
0.00% 13.340us 1 13.340us 13.340us 13.340us cudaSetDevice
0.00% 12.246us 4 3.0610us 985ns 8.0850us cuDeviceGetPCIBusId
0.00% 5.9690us 9 663ns 141ns 3.1970us cuDeviceGet
0.00% 2.8740us 1 2.8740us 2.8740us 2.8740us cuDevicePrimaryCtxGetState
0.00% 2.7800us 2 1.3900us 201ns 2.5790us cuCtxGetLimit
0.00% 1.7460us 5 349ns 256ns 595ns cuFuncGetAttribute
0.00% 1.1300us 4 282ns 202ns 443ns cuDeviceGetCount
0.00% 946ns 4 236ns 161ns 322ns cuDeviceGetUuid
0.00% 895ns 1 895ns 895ns 895ns cuInit
0.00% 737ns 1 737ns 737ns 737ns cuDevicePrimaryCtxSetFlags
0.00% 591ns 1 591ns 591ns 591ns cuCtxGetDevice
0.00% 371ns 1 371ns 371ns 371ns cuDevicePrimaryCtxRelease
0.00% 217ns 1 217ns 217ns 217ns cuDriverGetVersion
==112850== Unified Memory profiling result:
Device "Tesla V100-SXM2-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
4 32.000KB 4.0000KB 60.000KB 128.0000KB 20.64000us Host To Device
6 32.000KB 4.0000KB 60.000KB 192.0000KB 21.34400us Device To Host
3 - - - - 575.5490us Gpu page fault groups
16 4.0000KB 4.0000KB 4.0000KB 64.00000KB - Memory thrashes
Total CPU Page faults: 3
Total CPU thrashes: 16
- Then, I was able to isolate the issue to single file
exp2syn.cpp- if I comment out all OpenMP offload pragmas inexp2syn.cppthen special-core-gpu binary works fine!
But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! :
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
Memory (MBs) : After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
Memory (MBs) : Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
================ MECHANISMS COUNT BY TYPE ==================
Id Name Count
0 (null) 0
1 (null) 0
2 morphology 0
3 capacitance 392
4 pas 372
5 extracellular 0
6 fastpas 0
7 IClamp 0
8 AlphaSynapse 0
9 ExpSyn 40
10 Exp2Syn 0
11 SEClamp 0
12 VClamp 0
13 OClamp 0
14 APCount 0
15 na_ion 20
16 k_ion 20
17 hh 20
18 NetStim 1
19 IntFire1 0
20 IntFire2 0
21 IntFire4 0
22 PointProcessMark 0
23 PatternStim 0
24 HalfGap 0
You can see that Exp2Syn count is 0!
- In the
exp2syn.cppif I enable single OpenMP pragma e.g. in the functionnrn_state_Exp2Syn()(which is not executed!) then I still get below error:
$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
- During this experimentation I realised that CoreNEURON CMake doesn't set any optimisation flags when LLVM compiler is used. See https://github.com/BlueBrain/CoreNeuron/pull/734/files. So I compiled
exp2syn.cppwith optimisation flag-O1and the error disappeared!
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make
[ 1%] Built target pyembed
[ 2%] Built target fmt
[ 3%] Built target pywrapper
[ 3%] Built target pyastgen
[ 18%] Built target lexer_obj
[ 18%] Built target lexer
[ 21%] Built target util_obj
[ 22%] Built target util
[ 38%] Built target visitor_obj
[ 39%] Built target visitor
[ 43%] Built target codegen
[ 45%] Built target printer_obj
[ 45%] Built target printer
[ 46%] Built target symtab_obj
[ 47%] Built target symtab
[ 48%] Built target nmodl
[ 48%] Built target nrnivmodl-core
[ 50%] Built target kin_deriv_header
[ 83%] Built target coreneuron
[ 86%] Built target scopmath
Consolidate compiler generated dependencies of target special-core-gpu
[ 87%] Linking CXX executable ../bin/special-core-gpu
[ 93%] Built target special-core-gpu
[100%] Built target coreneuron-copy-nrnivmodl-core-dependencies
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Exp2Syn Reg
Memory (MBs) : After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047
Memory (MBs) : After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422
Memory (MBs) : Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734
WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
....
Memory (MBs) : After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562
psolve |=========================================================| t: 1.00 ETA: 0h00m01s
Solver Time : 0.253621
Simulation Statistics
Number of cells: 20
Number of compartments: 804
Number of presyns: 21
Number of input presyns: 0
Number of synapses: 21
Number of point processes: 41
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 40
Number of spikes with non negative gid-s: 40
==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 11.69% 21.776ms 160 136.10us 127.74us 143.14us __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
9.66% 17.998ms 80 224.98us 198.98us 236.25us __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
8.98% 16.731ms 160 104.57us 99.968us 108.10us __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
7.82% 14.565ms 80 182.06us 176.06us 187.84us __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
6.86% 12.780ms 80 159.75us 153.70us 163.58us __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
6.82% 12.697ms 80 158.71us 152.13us 163.20us __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
6.04% 11.251ms 80 140.64us 135.58us 144.13us __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
6.04% 11.250ms 80 140.62us 134.53us 146.59us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
6.03% 11.240ms 80 140.50us 135.33us 143.87us __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
6.01% 11.203ms 80 140.04us 133.09us 145.22us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
5.95% 11.076ms 80 138.45us 133.34us 144.96us __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
5.92% 11.035ms 80 137.94us 132.42us 141.38us __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
5.80% 10.795ms 80 134.94us 130.05us 139.90us __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
5.77% 10.751ms 80 134.39us 129.34us 137.79us __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
0.32% 598.59us 6 99.765us 2.7200us 447.39us [CUDA memset]
0.17% 323.17us 246 1.3130us 1.2470us 1.8240us [CUDA memcpy HtoD]
0.11% 195.62us 126 1.5520us 1.3760us 3.0090us [CUDA memcpy DtoH]
API calls: 39.19% 222.11ms 1450 153.18us 811ns 3.8766ms cuStreamSynchronize
38.77% 219.72ms 1 219.72ms 219.72ms 219.72ms cuDevicePrimaryCtxRetain
9.24% 52.366ms 1 52.366ms 52.366ms 52.366ms cuModuleLoadDataEx
4.44% 25.182ms 1 25.182ms 25.182ms 25.182ms cuModuleUnload
3.59% 20.330ms 6 3.3883ms 9.0800us 20.277ms cudaMallocManaged
2.13% 12.063ms 84 143.60us 11.176us 157.59us cuMemcpyDtoHAsync
1.20% 6.7851ms 1280 5.3000us 4.5340us 19.189us cuLaunchKernel
0.29% 1.6442ms 407 4.0390us 133ns 187.16us cuDeviceGetAttribute
0.28% 1.5656ms 4 391.41us 387.70us 395.00us cuDeviceTotalMem
0.25% 1.4421ms 1619 890ns 159ns 579.67us cuCtxSetCurrent
0.19% 1.0798ms 246 4.3890us 3.7420us 17.665us cuMemcpyHtoDAsync
0.15% 853.90us 6 142.32us 27.310us 519.82us cudaMemset
0.11% 647.30us 42 15.411us 11.264us 49.087us cuMemcpyDtoH
0.07% 412.45us 32 12.889us 1.7620us 191.92us cuStreamCreate
0.03% 168.17us 4 42.041us 36.633us 56.245us cuDeviceGetName
0.02% 112.19us 32 3.5050us 2.1360us 21.674us cuStreamDestroy
0.02% 108.93us 373 292ns 138ns 4.2190us cuGetProcAddress
0.01% 44.237us 55 804ns 401ns 3.3740us cuModuleGetGlobal
0.01% 34.313us 42 816ns 354ns 6.8130us cuModuleGetFunction
0.00% 14.817us 1 14.817us 14.817us 14.817us cudaSetDevice
0.00% 13.742us 1 13.742us 13.742us 13.742us cuMemAlloc
0.00% 11.865us 4 2.9660us 937ns 7.8230us cuDeviceGetPCIBusId
0.00% 7.1150us 9 790ns 142ns 4.0720us cuDeviceGet
0.00% 2.6710us 1 2.6710us 2.6710us 2.6710us cuDevicePrimaryCtxGetState
0.00% 2.5190us 2 1.2590us 224ns 2.2950us cuCtxGetLimit
0.00% 2.2450us 5 449ns 231ns 744ns cuFuncGetAttribute
0.00% 1.2930us 4 323ns 180ns 528ns cuDeviceGetCount
0.00% 1.1640us 1 1.1640us 1.1640us 1.1640us cuInit
0.00% 984ns 4 246ns 162ns 333ns cuDeviceGetUuid
0.00% 674ns 1 674ns 674ns 674ns cuDevicePrimaryCtxSetFlags
0.00% 502ns 1 502ns 502ns 502ns cuCtxGetDevice
0.00% 359ns 1 359ns 359ns 359ns cuDevicePrimaryCtxRelease
0.00% 215ns 1 215ns 215ns 215ns cuDriverGetVersion
..
So using #734 I am able to avoid the CUDA error: out of memory error.
TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow)
Summary of the IBM XL Compiler
Issues reported on llvm openmp-dev mailing list: https://lists.llvm.org/pipermail/openmp-dev/2022-January/004276.html
- Issue/Question # 1 : Shared library as well as Static library works
- Issue/Question # 2 : Static library works but Shared library fails to link:
+ CXX=xlc++_r
+ CXXFLAGS='-fopenmp -fPIC -qsmp=omp -qoffload -g -O2'
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -c test.cpp
+ ar cq libtest.a test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test1 main.cpp -L. -ltest
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test2 main.cpp test.o
+ rm test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -fpic -shared test.cpp -o libtest.so
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-269 fpic is not supported on this Operating System platform. Option fpic will be ignored.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/minilink: warning: 1501-308 The device linker only supports static linking. Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath .
nvlink error : Undefined reference to 'y' in '/tmp/24507_0.o'
...
...
$ nvprof ./test1
==29304== NVPROF is profiling process 29304, command: ./test1
--> 0
--> 1
--> 4
--> 2
--> 3
==29304== Profiling application: ./test1
==29304== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 79.20% 41.659us 1 41.659us 41.659us 41.659us __xl_main_l11_OL_1
11.99% 6.3040us 3 2.1010us 1.8560us 2.5280us [CUDA memcpy DtoH]
4.98% 2.6200us 1 2.6200us 2.6200us 2.6200us __xl__Z4testv_l8_OL_1
- Another issue that took quite some time to debug is following:
Historically famous one:
int *_displ = nrb->_displ;
int _displ_cnt = nrb->_displ_cnt;
#pragma omp target update to(_displ[0:_displ_cnt])
vs.
#pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
- First update the contents of the array
- Second is updating _displ pointer itself from host to device (and hence result in wrong pointer on device side).
In coreneuron/gpu/nrn_acc_manager.cpp under update_net_receive_buffer() I did:
#if 0
#pragma omp target update to(_displ[0:nrb->_displ_cnt])
#else
#pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
#endif
#pragma omp target
{
printf("nrb->_displ :%p \n", nrb->_displ);
}
abort();
And this produces following in first and second case:
Updating nrb now
nrb->_displ :0x7ffef000b100
Aborted (core dumped)
....
nrb->_displ :0x26a6b9c0
Aborted (core dumped)