CoreNeuron Fixes for building with LLVM / XL OpenMP offload

Various changes (including temporary) to make XL OpenMP offload build working
 * todo: need to rebase this branch on latest hackathon_master
 * todo: temporary changes to OpenAccHelper.cmake, needs refinement
 * todo: see caliper linkling issue
 * todo: _OPENACC needs to be renamed CORENRN_ENABLE_GPU so that OpenMP
         based builds can use GPU offload.
 * todo: hardcoded CXX flags for quick build

How to test this?

Outstanding issues?

[ ] Building with XLC gives (57cecb59d051b83adbca41fbe49680521b6d719f):

make[2]: *** [coreneuron/CMakeFiles/coreneuron.dir/build.make:114: coreneuron/CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o] Error 1
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/io/mech_report.cpp:12:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/coreneuron.hpp:24:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/utils/randoms/nrnran123.h:42:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/philox.h:37:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/compilerfeatures.h:218:
In file included from /m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/clangfeatures.h:91:
/m100/home/userexternal/pkumbhar/CoreNeuron/build/include/Random123/features/gccfeatures.h:48:10: fatal error: 'ppu_intrinsics.h' file not found
#include <ppu_intrinsics.h>
         ^~~~~~~~~~~~~~~~~~

[ ] Running ringtest with XLC gives (57cecb59d051b83adbca41fbe49680521b6d719f):

(venv) [pkumbhar@login01 build]$ ./bin/ppc64le/special-core  -d ../tests/integration/ring --gpu --cell-permute=2
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) :            After MPI_Init : Max 390.8125, Min 390.8125, Avg 390.8125
 Memory (MBs) :          Before nrn_setup : Max 390.8125, Min 390.8125, Avg 390.8125
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done   : 0.13 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 398.8750, Min 398.8750, Avg 398.8750
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=100

GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 398.8750, Min 398.8750, Avg 398.8750
 Memory (MBs) :     After nrn_finitialize : Max 398.6875, Min 398.6875, Avg 398.6875

1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-175 The underlying GPU runtime reported the following error "an illegal memory access was encountered".
1587-163 Error encountered while attempting to execute on the target device 0.  The program will stop.
1587-163 Error encountered while attempting to execute on the target device 0.  The program will stop.
free(): corrupted unsorted chunks

Issue seems to be with net_buf_receive_ExpSyn(). If I comment it out then it runs further

Running for 0 msec gives:

(venv) [pkumbhar@login01 build]$ gdb --args ./bin/ppc64le/special-core  -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8_0
...
(gdb) r
Starting program: /m100/home/userexternal/pkumbhar/CoreNeuron/build/bin/ppc64le/special-core -d ../tests/integration/ring --gpu --cell-permute=2 -e 0
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-72.el8_1.1.ppc64le
Missing separate debuginfo for /cineca/prod/opt/compilers/cuda/11.2/none/compat/libcuda.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d1/e9e189d76f924564adf8c9d73eeb713d5b23d9.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power9/libthread_db.so.1".
[New Thread 0x7fffefcbd890 (LWP 128546)]
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 57cecb59 (2022-01-03 17:00:52 +0100)

[New Thread 0x7fffcfffd890 (LWP 128551)]
 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :            After MPI_Init : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :          Before nrn_setup : Max 390.7500, Min 390.7500, Avg 390.7500
best_balance=0.848837 ncell=10 ntype=3 nwarp=10
best_balance=0.82093 ncell=10 ntype=3 nwarp=10
 Setup Done   : 0.00 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 390.7500, Min 390.7500, Avg 390.7500
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=0

GPU
--nwarp=65536
--cell-permute=2
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 390.7500, Min 390.7500, Avg 390.7500
 Memory (MBs) :     After nrn_finitialize : Max 390.5625, Min 390.5625, Avg 390.5625

psolve |=========================================================| t: 0.00   ETA: 0h00m00s

Solver Time : 4.69685e-05

Thread 1 "special-core" received signal SIGSEGV, Segmentation fault.
IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
180	    int n = ml->nodecount;
Missing separate debuginfos, use: dnf debuginfo-install libgcc-8.3.1-4.5.el8.ppc64le libstdc++-8.3.1-4.5.el8.ppc64le
(gdb) bt
#0  IPRA.$_ZN10coreneuronL17update_ml_on_hostEPKNS_9Memb_listEi (ml=0x7fff50009f00, type=-16288) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:180
#1  0x000000001014c28c in coreneuron::update_nrnthreads_on_host () at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/gpu/nrn_acc_manager.cpp:831
#2  0x00000000100df048 in run_solve_core (argc=-14016, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/coreneuron/apps/main1.cpp:630
#3  0x000000001003a048 in solve_core (argc=-13968, argv=0x0) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/enginemech.cpp:49
#4  0x0000000010039fd8 in main (argc=-13936, argv=0x7fff0000002e) at /m100/home/userexternal/pkumbhar/CoreNeuron/build/share/coreneuron/coreneuron.cpp:14

if I comment out update_nrnthreads_on_host() then it runs further.

CI_BRANCHES:NMODL_BRANCH=hackathon_main,NEURON_BRANCH=master,

Dec 10 '21 00:12 pramodk

Logfiles from GitLab pipeline #28712 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 10 '21 00:12 bbpbuildbot

On Ascent @ ORNL:

module load nvhpc/21.9 python/3.7.0 cmake flex bison
module swap cuda/10.1.243 cuda/11.0.2
module use /ccsopen/proj/gen170/neuron/spack_modules/linux-rhel7-power9le
module load caliper ninja py-pytest py-sympy py-jinja2 py-pyyaml boost
export NVLOCALRC=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/localrc
export PATH=/sw/ascent/gcc/6.4.0/bin:$PATH
module load xl/16.1.1-10

Configure XL for newer GCC:

xlc_configure -gcc /sw/ascent/gcc/10.2.0 -o /ccsopen/proj/gen170/neuron/nersc-gpu-hackaxlc_gcc10.cfg -cuda /sw/ascent/cuda/11.0.2 -cudaVersion 11.0 -cuda_cc_major 7 -cuda_cc_minor 0

And build with:

cmake .. -DCORENRN_ENABLE_CALIPER_PROFILING=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_NMODL=ON -DCMAKE_INSTALL_PREFIX=../install  -DCMAKE_CXX_FLAGS="-DR123_USE_SSE=0"  -DCMAKE_CUDA_ARCHITECTURES=70 -DCMAKE_CUDA_COMPILER=nvcc -DCORENRN_EXTERNAL_BENCHMARK_DATA=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/ -DCMAKE_CXX_FLAGS="" -DCORENRN_ENABLE_UNIT_TESTS=OFF -DCMAKE_CXX_COMPILER=xlc++_r -DCORENRN_NMODL_DIR=/ccsopen/proj/gen170/neuron/nersc-gpu-hackathon-dec-2021/users/kumbhar/nmodl/build/install -DCORENRN_ENABLE_MPI=ON

Dec 10 '21 00:12 pramodk

Logfiles from GitLab pipeline #28718 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 10 '21 07:12 bbpbuildbot

Logfiles from GitLab pipeline #28723 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 10 '21 08:12 bbpbuildbot

I rebased this and tried to resolve the conflicts fairly blindly.

Dec 17 '21 15:12 olupton

Logfiles from GitLab pipeline #29782 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 17 '21 16:12 bbpbuildbot

Logfiles from GitLab pipeline #30397 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 23 '21 11:12 bbpbuildbot

Logfiles from GitLab pipeline #30552 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 31 '21 13:12 bbpbuildbot

Logfiles from GitLab pipeline #30663 (:no_entry:) have been uploaded here!

Status and direct links:

Jan 03 '22 16:01 bbpbuildbot

Here are some notes from various experimentation / debugging attempts to get OpenMP offload working with LLVM v13.0.

On Slack thread, Olli reminded small reproducer from NERSC Hackathon showing issue with global variables and static library:

[olupton@r2i3n6 build_olli_gpu]$ cat test.sh
CXX=clang++
CXXFLAGS="-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -g"
${CXX} ${CXXFLAGS} -c test.cpp
ar cq libtest.a test.o
${CXX} ${CXXFLAGS} -o test1 main.cpp -L. -ltest
${CXX} ${CXXFLAGS} -o test2 main.cpp test.o
[olupton@r2i3n6 build_olli_gpu]$ cat test.cpp
#pragma omp declare target
int y;
#pragma omp end declare target

int test() {
  y = 24;
  #pragma omp target update to(y)
  y = 42;

  int x;
  #pragma omp target map(from:x)
  {
    x = y;
  }
  return x;
}
[olupton@r2i3n6 build_olli_gpu]$ cat main.cpp
extern int test();

int main() {
  return test();
}

small reproducer for the problem I am seeing now — going via the static library seems to cause problems :shrug:

[olupton@r2i3n6 build_olli_gpu]$ ./test1
CUDA error: Loading global 'y' Failed
CUDA error: named symbol not found
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
test.cpp:7:3: Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
[olupton@r2i3n6 build_olli_gpu]$ ./test2; echo $?
24

So I switched to shared library in coreneuron + nrnivmodl-core i.e. build shared library of libcorenrnmech. This was giving undefined symbol errors for global variables & function defined in libcoreneuron.a:

nvlink error   : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in

See also Olli's comment in NERSC GPU Hackathon Slack:

With the Clang/OpenMP build + shared libraries instead of static, I avoided some link errors by removing the annotations in the header

#pragma omp declare target 
extern double celsius;
#pragma omp end declare target

and the remaining one (at_time function) by making its body available. But now I have a new segfault at startup:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
(gdb) bt
#0  0x00007fffed92b801 in RTLsTy::RegisterRequires(long) () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#1  0x00007fffed92824e in __tgt_register_requires () from /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-rrwbmv/lib/libomptarget.so
#2  0x00007fffed8f29c3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#3  0x00007fffed8e417a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

  still investigating..

Then I avoided building libcorenrnmech library and tried to use mechanisms object files directly to create special-core (by modifying nrnivmodl-core-makefile). This didn't go too far - I saw similar linking errors with global symbols in libcoreneuron library:

 => Binary creating x86_64/special-core
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-lvcrm6/bin/clang++ -fopenmp=libomp      -std=c++14 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include  -Wl,--as-needed -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -o x86_64/special-core /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/share/coreneuron/coreneuron.cpp \
  -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include  \
  x86_64/corenrn/build/_mod_func.o x86_64/corenrn/build/_dimplic.o x86_64/corenrn/build/exp2syn.o x86_64/corenrn/build/expsyn.o x86_64/corenrn/build/halfgap.o x86_64/corenrn/build/hh.o x86_64/corenrn/build/netstim.o x86_64/corenrn/build/passive.o x86_64/corenrn/build/pattern.o x86_64/corenrn/build/stim.o x86_64/corenrn/build/svclmp.o  x86_64/corenrn/build/enginemech.o
nvlink error   : Undefined reference to '_ZN10coreneuron7celsiusE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron2piE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron11secondorderE' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/_dimplic-ba4873.cubin'
nvlink error   : Undefined reference to '_ZN10coreneuron7at_timeEPNS_9NrnThreadEd' in '/gpfs/bbp.cscs.ch/ssd/slurmTmpFS/kumbhar/149976/svclmp-852313.cubin'

Next, I wanted to avoid building any intermediate libraries i.e. create special-core binary by taking all objects. For this, I took mechanism cpp files generated by nrnivmodl-core and added into coreneuron source tree (temporarily). Then, updated cmake to build special-core directly via cmake i.e. something like (partial diff):

+++ b/coreneuron/CMakeLists.txt
@@ -47,6 +47,13 @@ list(APPEND CORENEURON_CODE_FILES ${PROJECT_BINARY_DIR}/coreneuron/config/config
 set(DIMPLIC_CODE_FILE "mechanism/mech/dimplic.cpp")
 set(ENGINEMECH_CODE_FILE "mechanism/mech/enginemech.cpp")

+file(GLOB CORENEURON_SPECIAL_CORE_FILES "exe/*.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${CMAKE_CURRENT_SOURCE_DIR}/apps/coreneuron.cpp")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${DIMPLIC_CODE_FILE}")
+list(APPEND CORENEURON_SPECIAL_CORE_FILES "${ENGINEMECH_CODE_FILE}")
+
 # for external mod files we need to generate modl_ref function in mod_func.c
 set(MODFUNC_PERL_SCRIPT "mechanism/mech/mod_func.c.pl")

@@ -184,6 +191,8 @@ if(CORENRN_ENABLE_MPI AND NOT CORENRN_ENABLE_MPI_DYNAMIC)
   set(CORENRN_MPI_OBJ $<TARGET_OBJECTS:${CORENRN_MPI_LIB_NAME}>)
 endif()

+set(COMPILE_LIBRARY_TYPE OBJECT)
+
 # main coreneuron library
 add_library(
   coreneuron
@@ -319,7 +328,7 @@ add_custom_command(
           "${modfile_directory}"
   WORKING_DIRECTORY ${CMAKE_BINARY_DIR}/bin
   COMMENT "Running nrnivmodl-core with halfgap.mod")
-add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})
+#add_custom_target(nrniv-core ALL DEPENDS ${output_binaries})

 include_directories(${CORENEURON_PROJECT_SOURCE_DIR})

@@ -358,6 +367,12 @@ configure_file("utils/profile/profiler_interface.h"
 # main program required for building special-core
 file(COPY apps/coreneuron.cpp DESTINATION ${CMAKE_BINARY_DIR}/share/coreneuron)

+add_executable(special-core-gpu ${CORENEURON_SPECIAL_CORE_FILES})
+target_compile_options(special-core-gpu BEFORE PRIVATE $<$<COMPILE_LANGUAGE:CXX>:${CORENRN_ACC_FLAGS}>)
+target_compile_definitions(special-core-gpu PUBLIC -DADDITIONAL_MECHS)
+target_link_libraries(special-core-gpu coreneuron ${CMAKE_DL_LIBS})

This created exe using all object files:

[ 87%] Linking CXX executable ../bin/special-core-gpu
cd /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron && /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cmake-3.21.4-cdyb7k/bin/cmake -E cmake_link_script CMakeFiles/special-core-gpu.dir/link.txt --verbose=1
/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++  -fopenmp=libomp   -Wl,--as-needed -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include CMakeFiles/special-core-gpu.dir/exe/_mod_func.cpp.o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o CMakeFiles/special-core-gpu.dir/exe/expsyn.cpp.o CMakeFiles/special-core-gpu.dir/exe/halfgap.cpp.o CMakeFiles/special-core-gpu.dir/exe/hh.cpp.o CMakeFiles/special-core-gpu.dir/exe/netstim.cpp.o CMakeFiles/special-core-gpu.dir/exe/passive.cpp.o CMakeFiles/special-core-gpu.dir/exe/pattern.cpp.o CMakeFiles/special-core-gpu.dir/exe/stim.cpp.o CMakeFiles/special-core-gpu.dir/exe/svclmp.cpp.o CMakeFiles/special-core-gpu.dir/apps/coreneuron.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/dimplic.cpp.o CMakeFiles/special-core-gpu.dir/mechanism/mech/enginemech.cpp.o CMakeFiles/coreneuron.dir/apps/corenrn_parameters.cpp.o CMakeFiles/coreneuron.dir/apps/main1.cpp.o CMakeFiles/coreneuron.dir/gpu/nrn_acc_manager.cpp.o CMakeFiles/coreneuron.dir/io/core2nrn_data_return.cpp.o CMakeFiles/coreneuron.dir/io/file_utils.cpp.o CMakeFiles/coreneuron.dir/io/global_vars.cpp.o CMakeFiles/coreneuron.dir/io/lfp.cpp.o CMakeFiles/coreneuron.dir/io/mech_report.cpp.o CMakeFiles/coreneuron.dir/io/mem_layout_util.cpp.o CMakeFiles/coreneuron.dir/io/mk_mech.cpp.o CMakeFiles/coreneuron.dir/io/nrn2core_data_init.cpp.o CMakeFiles/coreneuron.dir/io/nrn_checkpoint.cpp.o CMakeFiles/coreneuron.dir/io/nrn_filehandler.cpp.o CMakeFiles/coreneuron.dir/io/nrn_setup.cpp.o CMakeFiles/coreneuron.dir/io/output_spikes.cpp.o CMakeFiles/coreneuron.dir/io/phase1.cpp.o CMakeFiles/coreneuron.dir/io/phase2.cpp.o CMakeFiles/coreneuron.dir/io/prcellstate.cpp.o CMakeFiles/coreneuron.dir/io/reports/binary_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/nrnreport.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_configuration_parser.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_event.cpp.o CMakeFiles/coreneuron.dir/io/reports/report_handler.cpp.o CMakeFiles/coreneuron.dir/io/reports/sonata_report_handler.cpp.o CMakeFiles/coreneuron.dir/io/setup_fornetcon.cpp.o CMakeFiles/coreneuron.dir/mechanism/capac.cpp.o CMakeFiles/coreneuron.dir/mechanism/eion.cpp.o CMakeFiles/coreneuron.dir/mechanism/mech_mapping.cpp.o CMakeFiles/coreneuron.dir/mechanism/patternstim.cpp.o CMakeFiles/coreneuron.dir/mechanism/register_mech.cpp.o CMakeFiles/coreneuron.dir/network/cvodestb.cpp.o CMakeFiles/coreneuron.dir/network/multisend.cpp.o CMakeFiles/coreneuron.dir/network/multisend_setup.cpp.o CMakeFiles/coreneuron.dir/network/netcvode.cpp.o CMakeFiles/coreneuron.dir/network/netpar.cpp.o CMakeFiles/coreneuron.dir/network/partrans.cpp.o CMakeFiles/coreneuron.dir/network/partrans_setup.cpp.o CMakeFiles/coreneuron.dir/network/tqueue.cpp.o CMakeFiles/coreneuron.dir/permute/balance.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder1.cpp.o CMakeFiles/coreneuron.dir/permute/cellorder2.cpp.o CMakeFiles/coreneuron.dir/permute/data_layout.cpp.o CMakeFiles/coreneuron.dir/permute/node_permute.cpp.o CMakeFiles/coreneuron.dir/sim/fadvance_core.cpp.o CMakeFiles/coreneuron.dir/sim/fast_imem.cpp.o CMakeFiles/coreneuron.dir/sim/finitialize.cpp.o CMakeFiles/coreneuron.dir/sim/multicore.cpp.o CMakeFiles/coreneuron.dir/sim/solve_core.cpp.o CMakeFiles/coreneuron.dir/sim/treeset_core.cpp.o CMakeFiles/coreneuron.dir/utils/ispc/globals.cpp.o CMakeFiles/coreneuron.dir/utils/ivocvect.cpp.o CMakeFiles/coreneuron.dir/utils/lpt.cpp.o CMakeFiles/coreneuron.dir/utils/memory.cpp.o CMakeFiles/coreneuron.dir/utils/memory_utils.cpp.o CMakeFiles/coreneuron.dir/utils/nrn_stats.cpp.o CMakeFiles/coreneuron.dir/utils/nrnoc_aux.cpp.o CMakeFiles/coreneuron.dir/utils/nrntimeout.cpp.o CMakeFiles/coreneuron.dir/utils/progressbar/progressbar.cpp.o CMakeFiles/coreneuron.dir/utils/randoms/nrnran123.cpp.o CMakeFiles/coreneuron.dir/utils/string_utils.cpp.o CMakeFiles/coreneuron.dir/utils/utils.cpp.o CMakeFiles/coreneuron.dir/utils/vrecord.cpp.o CMakeFiles/coreneuron.dir/config/config.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi_def_cinc.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpi.cpp.o CMakeFiles/coreneuron.dir/mpi/core/nrnmpidec.cpp.o -o ../bin/special-core-gpu  -Wl,-rpath,/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64 -ldl /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/lib64/libcudart.so

Unfortunately, this still gives similar cryptic error at launch for GPU or CPU execution:

kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build$ ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory
Aborted

I saw one such error reported here but that doesn't seem relevant.

Next, I wanted to incrementally enable OpenMP offload part and find out what causes above error. So, first thing I did was to disable OpenMP offload parts from mechanisms cpp files i.e. simply something like:

cd coreneuron/exe         #  this directory contain exp2syn.cpp  expsyn.cpp  halfgap.cpp  hh.cpp  _kinderiv.h  _mod_func.cpp  netstim.cpp  passive.cpp  pattern.cpp  stim.cpp  svclmp.cpp
sed -i 's#nrn_pragma_omp#//nrn_pragma_omp#g' *.cpp

And by re-building I saw that the special-core-gpu binary that we build is running fine on CPU or GPU! So the issue seems to be somehow related to mechanisms cpp files!

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu
==112850== NVPROF is profiling process 112850, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

Exp2Syn Reg
 Memory (MBs) :             After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) :            After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) :          Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.
 Setup Done   : 0.00 seconds
 Model size   : 84.19 kB
 Memory (MBs) :          After nrn_setup  : Max 296.4258, Min 296.4258, Avg 296.4258
GENERAL PARAMETERS
--mpi=false
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1

GPU
--nwarp=65536
--cell-permute=0
--cuda-interface=false

INPUT PARAMETERS
--voltage=-65
--seed=-1
--datpath=../tests/integration/ring
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=false

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=1
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 296.4258, Min 296.4258, Avg 296.4258
....
  Memory (MBs) :     After nrn_finitialize : Max 301.5273, Min 301.5273, Avg 301.5273

psolve |=========================================================| t: 1.00   ETA: 0h00m00s

Solver Time : 0.24223


 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==112850== Profiling application: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu
==112850== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.68%  20.892ms       160  130.57us  126.98us  134.59us  __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
                    9.64%  17.250ms        80  215.63us  186.62us  222.72us  __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
                    8.98%  16.057ms       160  100.35us  99.007us  101.92us  __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
                    7.72%  13.816ms        80  172.70us  169.18us  174.59us  __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
                    6.87%  12.293ms        80  153.66us  152.13us  154.94us  __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
                    6.82%  12.200ms        80  152.50us  150.37us  154.30us  __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
                    6.07%  10.865ms        80  135.81us  135.23us  136.58us  __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
                    6.05%  10.826ms        80  135.33us  132.64us  136.80us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
                    6.04%  10.806ms        80  135.07us  133.41us  137.73us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
                    6.01%  10.746ms        80  134.32us  133.73us  135.33us  __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
                    5.94%  10.621ms        80  132.76us  132.35us  136.48us  __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
                    5.92%  10.589ms        80  132.36us  131.52us  133.31us  __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
                    5.84%  10.446ms        80  130.57us  127.07us  131.42us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
                    5.79%  10.358ms        80  129.48us  128.90us  132.19us  __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
                    0.33%  594.30us         6  99.050us  2.4320us  446.46us  [CUDA memset]
                    0.18%  318.05us       246  1.2920us  1.2470us  1.7920us  [CUDA memcpy HtoD]
                    0.10%  185.74us       125  1.4850us  1.3750us  2.9440us  [CUDA memcpy DtoH]
      API calls:   38.95%  214.73ms         1  214.73ms  214.73ms  214.73ms  cuDevicePrimaryCtxRetain
                   38.43%  211.84ms      1450  146.09us     756ns  275.66us  cuStreamSynchronize
                    9.61%  52.977ms         1  52.977ms  52.977ms  52.977ms  cuModuleLoadDataEx
                    4.59%  25.275ms         1  25.275ms  25.275ms  25.275ms  cuModuleUnload
                    3.69%  20.328ms         6  3.3880ms  8.9020us  20.278ms  cudaMallocManaged
                    2.10%  11.594ms        84  138.03us  11.220us  146.78us  cuMemcpyDtoHAsync
                    1.17%  6.4369ms      1280  5.0280us  4.2990us  17.611us  cuLaunchKernel
                    0.30%  1.6383ms       407  4.0250us     132ns  194.16us  cuDeviceGetAttribute
                    0.28%  1.5530ms         4  388.24us  385.38us  395.38us  cuDeviceTotalMem
                    0.26%  1.4583ms      1619     900ns     161ns  581.21us  cuCtxSetCurrent
                    0.19%  1.0351ms       246  4.2070us  3.6880us  13.495us  cuMemcpyHtoDAsync
                    0.15%  844.22us         6  140.70us  26.971us  516.61us  cudaMemset
                    0.11%  591.25us        41  14.420us  11.523us  44.503us  cuMemcpyDtoH
                    0.07%  407.33us        32  12.729us  1.7660us  190.68us  cuStreamCreate
                    0.03%  172.74us         4  43.183us  36.483us  57.844us  cuDeviceGetName
                    0.02%  125.94us        32  3.9350us  2.2320us  27.967us  cuStreamDestroy
                    0.02%  95.955us       373     257ns     144ns  4.1180us  cuGetProcAddress
                    0.01%  44.318us        54     820ns     292ns  5.4780us  cuModuleGetGlobal
                    0.01%  30.319us        41     739ns     377ns  4.3020us  cuModuleGetFunction
                    0.00%  16.994us         1  16.994us  16.994us  16.994us  cuMemAlloc
                    0.00%  13.340us         1  13.340us  13.340us  13.340us  cudaSetDevice
                    0.00%  12.246us         4  3.0610us     985ns  8.0850us  cuDeviceGetPCIBusId
                    0.00%  5.9690us         9     663ns     141ns  3.1970us  cuDeviceGet
                    0.00%  2.8740us         1  2.8740us  2.8740us  2.8740us  cuDevicePrimaryCtxGetState
                    0.00%  2.7800us         2  1.3900us     201ns  2.5790us  cuCtxGetLimit
                    0.00%  1.7460us         5     349ns     256ns     595ns  cuFuncGetAttribute
                    0.00%  1.1300us         4     282ns     202ns     443ns  cuDeviceGetCount
                    0.00%     946ns         4     236ns     161ns     322ns  cuDeviceGetUuid
                    0.00%     895ns         1     895ns     895ns     895ns  cuInit
                    0.00%     737ns         1     737ns     737ns     737ns  cuDevicePrimaryCtxSetFlags
                    0.00%     591ns         1     591ns     591ns     591ns  cuCtxGetDevice
                    0.00%     371ns         1     371ns     371ns     371ns  cuDevicePrimaryCtxRelease
                    0.00%     217ns         1     217ns     217ns     217ns  cuDriverGetVersion

==112850== Unified Memory profiling result:
Device "Tesla V100-SXM2-16GB (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       4  32.000KB  4.0000KB  60.000KB  128.0000KB  20.64000us  Host To Device
       6  32.000KB  4.0000KB  60.000KB  192.0000KB  21.34400us  Device To Host
       3         -         -         -           -  575.5490us  Gpu page fault groups
      16  4.0000KB  4.0000KB  4.0000KB  64.00000KB           -  Memory thrashes
Total CPU Page faults: 3
Total CPU thrashes: 16

Then, I was able to isolate the issue to single file exp2syn.cpp - if I comment out all OpenMP offload pragmas in exp2syn.cpp then special-core-gpu binary works fine!

But the funny part with exp2syn.cpp is that this mechanism is not used in the ringtest model at all ! :

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu --model-stats
==112982== NVPROF is profiling process 112982, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 298.0469, Min 298.0469, Avg 298.0469
 Memory (MBs) :            After MPI_Init : Max 296.1094, Min 296.1094, Avg 296.1094
 Memory (MBs) :          Before nrn_setup : Max 296.1406, Min 296.1406, Avg 296.1406
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.

================ MECHANISMS COUNT BY TYPE ==================
  Id                 Name      Count
   0               (null)          0
   1               (null)          0
   2           morphology          0
   3          capacitance        392
   4                  pas        372
   5        extracellular          0
   6              fastpas          0
   7               IClamp          0
   8         AlphaSynapse          0
   9               ExpSyn         40
  10              Exp2Syn          0
  11              SEClamp          0
  12               VClamp          0
  13               OClamp          0
  14              APCount          0
  15               na_ion         20
  16                k_ion         20
  17                   hh         20
  18              NetStim          1
  19             IntFire1          0
  20             IntFire2          0
  21             IntFire4          0
  22     PointProcessMark          0
  23          PatternStim          0
  24              HalfGap          0

You can see that Exp2Syn count is 0!

In the exp2syn.cpp if I enable single OpenMP pragma e.g. in the functionnrn_state_Exp2Syn() (which is not executed!) then I still get below error:

$ nvprof ./bin/special-core-gpu -e 1 -d ../tests/integration/ring  --gpu --model-stats
==113465== NVPROF is profiling process 113465, command: ./bin/special-core-gpu -e 1 -d ../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

CUDA error: Error returned from cuModuleLoadDataEx
CUDA error: out of memory
Libomptarget error: Unable to generate entries table for device id 0.
Libomptarget error: Failed to init globals on device 0
Libomptarget error: Run with LIBOMPTARGET_INFO=4 to dump host-target pointer mappings.
Libomptarget error: Source location information not present. Compile with -g or -gline-tables-only.
Libomptarget fatal error 1: failure of target construct while offloading is mandatory

During this experimentation I realised that CoreNEURON CMake doesn't set any optimisation flags when LLVM compiler is used. See https://github.com/BlueBrain/CoreNeuron/pull/734/files. So I compiled exp2syn.cpp with optimisation flag -O1 and the error disappeared!

kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ rm ../bin/special-core-gpu
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/llvm-13.0.0-klsplt/bin/clang++ -DADDITIONAL_MECHS -DCORENEURON_BUILD -DCORENEURON_CUDA_PROFILING -DCORENEURON_ENABLE_GPU -DCORENEURON_PREFER_OPENMP_OFFLOAD -DDISABLE_HOC_EXP -DENABLE_SPLAYTREE_QUEUING -DLAYOUT=0 -DNRNMPI=0 -DNRN_MULTISEND=0 -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/targets/x86_64-linux/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/include -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/utils/randoms -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron -I/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron -isystem /gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -fopenmp=libomp   -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -Wno-unknown-cuda-version -I/gpfs/bbp.cscs.ch/ssd/apps/bsd/2021-11/stage_externals/install_gcc-11.2.0-skylake/cuda-11.4.2-yopegt/include -std=c++14 -MD -MT coreneuron/CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -MF CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o.d -o CMakeFiles/special-core-gpu.dir/exe/exp2syn.cpp.o -c /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/coreneuron/coreneuron/exe/exp2syn.cpp -O1
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ make
[  1%] Built target pyembed
[  2%] Built target fmt
[  3%] Built target pywrapper
[  3%] Built target pyastgen
[ 18%] Built target lexer_obj
[ 18%] Built target lexer
[ 21%] Built target util_obj
[ 22%] Built target util
[ 38%] Built target visitor_obj
[ 39%] Built target visitor
[ 43%] Built target codegen
[ 45%] Built target printer_obj
[ 45%] Built target printer
[ 46%] Built target symtab_obj
[ 47%] Built target symtab
[ 48%] Built target nmodl
[ 48%] Built target nrnivmodl-core
[ 50%] Built target kin_deriv_header
[ 83%] Built target coreneuron
[ 86%] Built target scopmath
Consolidate compiler generated dependencies of target special-core-gpu
[ 87%] Linking CXX executable ../bin/special-core-gpu
[ 93%] Built target special-core-gpu
[100%] Built target coreneuron-copy-nrnivmodl-core-dependencies
kumbhar@r2i3n6:~/workarena/systems/bbpv/repos/bbp/coreneuron/build/coreneuron$ nvprof ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring  --gpu --model-stats
==115317== NVPROF is profiling process 115317, command: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 4de7951f (2022-01-04 16:29:44 +0100)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

Exp2Syn Reg
 Memory (MBs) :             After mk_mech : Max 297.8047, Min 297.8047, Avg 297.8047
 Memory (MBs) :            After MPI_Init : Max 296.2422, Min 296.2422, Avg 296.2422
 Memory (MBs) :          Before nrn_setup : Max 296.2734, Min 296.2734, Avg 296.2734
 WARNING : GPU execution requires --cell-permute type 1 or 2. Setting it to 1.

....
  Memory (MBs) :     After nrn_finitialize : Max 301.6562, Min 301.6562, Avg 301.6562

psolve |=========================================================| t: 1.00   ETA: 0h00m01s

Solver Time : 0.253621


 Simulation Statistics
 Number of cells: 20
 Number of compartments: 804
 Number of presyns: 21
 Number of input presyns: 0
 Number of synapses: 21
 Number of point processes: 41
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 40
 Number of spikes with non negative gid-s: 40
==115317== Profiling application: ../bin/special-core-gpu -e 1 -d ../../tests/integration/ring --gpu --model-stats
==115317== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   11.69%  21.776ms       160  136.10us  127.74us  143.14us  __omp_offloading_2f_188133fe__ZN10coreneuron11nrn_cur_ionEPNS_9NrnThreadEPNS_9Memb_listEi_l273
                    9.66%  17.998ms        80  224.98us  198.98us  236.25us  __omp_offloading_2f_2979dc5__ZN10coreneuron12nrn_state_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l520
                    8.98%  16.731ms       160  104.57us  99.968us  108.10us  __omp_offloading_2f_12055abf__ZN10coreneuron22net_buf_receive_ExpSynEPNS_9NrnThreadE_l290
                    7.82%  14.565ms        80  182.06us  176.06us  187.84us  __omp_offloading_2f_2979dc5__ZN10coreneuron10nrn_cur_hhEPNS_9NrnThreadEPNS_9Memb_listEi_l472
                    6.86%  12.780ms        80  159.75us  153.70us  163.58us  __omp_offloading_2f_12055abf__ZN10coreneuron14nrn_cur_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l383
                    6.82%  12.697ms        80  158.71us  152.13us  163.20us  __omp_offloading_2f_3962f8e__ZN10coreneuron11nrn_cur_pasEPNS_9NrnThreadEPNS_9Memb_listEi_l276
                    6.04%  11.251ms        80  140.64us  135.58us  144.13us  __omp_offloading_2f_12055abf__ZN10coreneuron16nrn_state_ExpSynEPNS_9NrnThreadEPNS_9Memb_listEi_l426
                    6.04%  11.250ms        80  140.62us  134.53us  146.59us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l96
                    6.03%  11.240ms        80  140.50us  135.33us  143.87us  __omp_offloading_2f_198f6a66__ZN10coreneuron21nrn_jacob_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l74
                    6.01%  11.203ms        80  140.04us  133.09us  145.22us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_lhsEPNS_9NrnThreadE_l166
                    5.95%  11.076ms        80  138.45us  133.34us  144.96us  __omp_offloading_2f_191507c2__ZN10coreneuron8NetCvode12check_threshEPNS_9NrnThreadE_l541
                    5.92%  11.035ms        80  137.94us  132.42us  141.38us  __omp_offloading_2f_198f6a66__ZN10coreneuron19nrn_cur_capacitanceEPNS_9NrnThreadEPNS_9Memb_listEi_l120
                    5.80%  10.795ms        80  134.94us  130.05us  139.90us  __omp_offloading_2f_1d8e98f0__ZN10coreneuron6updateEPNS_9NrnThreadE_l217
                    5.77%  10.751ms        80  134.39us  129.34us  137.79us  __omp_offloading_2f_191507f3__ZN10coreneuronL7nrn_rhsEPNS_9NrnThreadE_l37
                    0.32%  598.59us         6  99.765us  2.7200us  447.39us  [CUDA memset]
                    0.17%  323.17us       246  1.3130us  1.2470us  1.8240us  [CUDA memcpy HtoD]
                    0.11%  195.62us       126  1.5520us  1.3760us  3.0090us  [CUDA memcpy DtoH]
      API calls:   39.19%  222.11ms      1450  153.18us     811ns  3.8766ms  cuStreamSynchronize
                   38.77%  219.72ms         1  219.72ms  219.72ms  219.72ms  cuDevicePrimaryCtxRetain
                    9.24%  52.366ms         1  52.366ms  52.366ms  52.366ms  cuModuleLoadDataEx
                    4.44%  25.182ms         1  25.182ms  25.182ms  25.182ms  cuModuleUnload
                    3.59%  20.330ms         6  3.3883ms  9.0800us  20.277ms  cudaMallocManaged
                    2.13%  12.063ms        84  143.60us  11.176us  157.59us  cuMemcpyDtoHAsync
                    1.20%  6.7851ms      1280  5.3000us  4.5340us  19.189us  cuLaunchKernel
                    0.29%  1.6442ms       407  4.0390us     133ns  187.16us  cuDeviceGetAttribute
                    0.28%  1.5656ms         4  391.41us  387.70us  395.00us  cuDeviceTotalMem
                    0.25%  1.4421ms      1619     890ns     159ns  579.67us  cuCtxSetCurrent
                    0.19%  1.0798ms       246  4.3890us  3.7420us  17.665us  cuMemcpyHtoDAsync
                    0.15%  853.90us         6  142.32us  27.310us  519.82us  cudaMemset
                    0.11%  647.30us        42  15.411us  11.264us  49.087us  cuMemcpyDtoH
                    0.07%  412.45us        32  12.889us  1.7620us  191.92us  cuStreamCreate
                    0.03%  168.17us         4  42.041us  36.633us  56.245us  cuDeviceGetName
                    0.02%  112.19us        32  3.5050us  2.1360us  21.674us  cuStreamDestroy
                    0.02%  108.93us       373     292ns     138ns  4.2190us  cuGetProcAddress
                    0.01%  44.237us        55     804ns     401ns  3.3740us  cuModuleGetGlobal
                    0.01%  34.313us        42     816ns     354ns  6.8130us  cuModuleGetFunction
                    0.00%  14.817us         1  14.817us  14.817us  14.817us  cudaSetDevice
                    0.00%  13.742us         1  13.742us  13.742us  13.742us  cuMemAlloc
                    0.00%  11.865us         4  2.9660us     937ns  7.8230us  cuDeviceGetPCIBusId
                    0.00%  7.1150us         9     790ns     142ns  4.0720us  cuDeviceGet
                    0.00%  2.6710us         1  2.6710us  2.6710us  2.6710us  cuDevicePrimaryCtxGetState
                    0.00%  2.5190us         2  1.2590us     224ns  2.2950us  cuCtxGetLimit
                    0.00%  2.2450us         5     449ns     231ns     744ns  cuFuncGetAttribute
                    0.00%  1.2930us         4     323ns     180ns     528ns  cuDeviceGetCount
                    0.00%  1.1640us         1  1.1640us  1.1640us  1.1640us  cuInit
                    0.00%     984ns         4     246ns     162ns     333ns  cuDeviceGetUuid
                    0.00%     674ns         1     674ns     674ns     674ns  cuDevicePrimaryCtxSetFlags
                    0.00%     502ns         1     502ns     502ns     502ns  cuCtxGetDevice
                    0.00%     359ns         1     359ns     359ns     359ns  cuDevicePrimaryCtxRelease
                    0.00%     215ns         1     215ns     215ns     215ns  cuDriverGetVersion
..

So using #734 I am able to avoid the CUDA error: out of memory error.

TODO: Go back to nrnivmodl-core based build and see how this could be tested there. (tomorrow)

Jan 04 '22 22:01 pramodk

Summary of the IBM XL Compiler

Issues reported on llvm openmp-dev mailing list: https://lists.llvm.org/pipermail/openmp-dev/2022-January/004276.html

Issue/Question # 1 : Shared library as well as Static library works
Issue/Question # 2 : Static library works but Shared library fails to link:

+ CXX=xlc++_r
+ CXXFLAGS='-fopenmp -fPIC -qsmp=omp -qoffload -g -O2'
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -c test.cpp
+ ar cq libtest.a test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test1 main.cpp -L. -ltest
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test2 main.cpp test.o
+ rm test.o
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -fpic -shared test.cpp -o libtest.so
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-269 fpic is not supported on this Operating System platform.  Option fpic will be ignored.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/xlc++_r: warning: 1501-308 The device linker only supports static linking.  Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
/cineca/prod/opt/compilers/xl/16.1.1_sp4.1/binary/xlC/16.1.1/bin/.orig/minilink: warning: 1501-308 The device linker only supports static linking.  Any device code placed into a shared library by the qmkshrobj option will be inaccessible.
+ xlc++_r -fopenmp -fPIC -qsmp=omp -qoffload -g -O2 -o test3 main.cpp -L. -ltest -Wl,-rpath .
nvlink error   : Undefined reference to 'y' in '/tmp/24507_0.o'
...
...
$ nvprof ./test1
==29304== NVPROF is profiling process 29304, command: ./test1
--> 0
--> 1
--> 4
--> 2
--> 3
==29304== Profiling application: ./test1
==29304== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   79.20%  41.659us         1  41.659us  41.659us  41.659us  __xl_main_l11_OL_1
                   11.99%  6.3040us         3  2.1010us  1.8560us  2.5280us  [CUDA memcpy DtoH]
                    4.98%  2.6200us         1  2.6200us  2.6200us  2.6200us  __xl__Z4testv_l8_OL_1

Another issue that took quite some time to debug is following:

Historically famous one:

                int *_displ = nrb->_displ;
                int _displ_cnt = nrb->_displ_cnt;
                #pragma omp target update to(_displ[0:_displ_cnt])

vs.

                #pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])

First update the contents of the array
Second is updating _displ pointer itself from host to device (and hence result in wrong pointer on device side).

In coreneuron/gpu/nrn_acc_manager.cpp under update_net_receive_buffer() I did:

#if 0
                #pragma omp target update to(_displ[0:nrb->_displ_cnt])
#else
                #pragma omp target update to(nrb->_displ[0:nrb->_displ_cnt])
#endif
                #pragma omp target
                {
                        printf("nrb->_displ :%p \n", nrb->_displ);
                }
                abort();

And this produces following in first and second case:

Updating nrb now
nrb->_displ :0x7ffef000b100
Aborted (core dumped)
....
nrb->_displ :0x26a6b9c0
Aborted (core dumped)

Jan 05 '22 13:01 pramodk