CoreNeuron Support async execution in OpenMP wherever it's supported

Description

Added asynchronous execution of kernels in multiple GPU streams. Still there are some constructs that the compiler doesn't support:

#pragma omp target update to(<variable>) depend(inout: stream) nowait is not working, even if update from is working. There is an internal compiler error whenever depend(..) nowait is added to the to clause.

coreneuron::nrn_fixed_step_lastpart(coreneuron::NrnThread *):
    386, Taskwait
         Generating update to(nth->_t)
/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/pulls/1392/deploy/externals/2021-12-10/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.11-qhk3q2/Linux_x86_64/21.11/compilers/share/llvm/bin/opt: /gpfs/bbp.cscs.ch/ssd/slurmTmpFS/magkanar/140832/nvc++xfYwftRovDgD.ll:144924:43: error: use of undefined value '%.d0009.addr'
        %41 = bitcast [1 x %struct.struct_deps]* %.d0009.addr to i8*, !dbg !120921

#pragma omp taskwait depend(inout: stream) is not working even if it's referenced in an NVIDIA presentation

How to test this?

module load unstable
module load cmake git flex bison python-dev hpe-mpi/2.25.hmpt
module unload hpe-mpi/2.22.hmpt py-mpi4py
module load caliper
module unload cuda/11.0.2
module load gcc
module load boost
module use /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/pulls/1392/deploy/compilers/2021-12-10/modules/tcl/linux-rhel7-x86_64
module use /gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/pulls/1392/deploy/externals/2021-12-10/modules/tcl/linux-rhel7-x86_64
module load nvhpc/21.11 cuda/11.5.1
cmake .. \
          -DCMAKE_INSTALL_PREFIX=./install \
          -DCORENRN_ENABLE_TIMEOUT=OFF \
          -DNRN_ENABLE_INTERVIEWS=OFF \
          -DNRN_ENABLE_RX3D=OFF \
          -DNRN_ENABLE_MPI=ON \
          -DCORENRN_ENABLE_OPENMP=ON \
          -DNRN_ENABLE_CORENEURON=ON \
          -DCORENRN_ENABLE_GPU=ON \
          -DCORENRN_ENABLE_NMODL=ON \
          -DCORENRN_NMODL_DIR=<nmodl_dir> \
          -DNRN_ENABLE_PYTHON=ON \
          -DPYTHON_EXECUTABLE=$(which python3) \
          -DNRN_ENABLE_TESTS=OFF \
          -DCORENRN_ENABLE_UNIT_TESTS=OFF \
          -DCMAKE_C_COMPILER=$CC \
          -DCMAKE_CXX_COMPILER=$CXX \
          -DCMAKE_CUDA_COMPILER=nvcc \
          -DCMAKE_BUILD_TYPE=RelWithDebInfo \
          -DCORENRN_ENABLE_CALIPER_PROFILING=ON \
          -DCORENRN_ENABLE_OPENMP_OFFLOAD=ON \
          -DCMAKE_CXX_FLAGS="-Minfo=accel -gopt -tp=skylake-avx512"
cmake --build . --parallel 40 --target install

Test System

OS: RedHat
Compiler: NVHPC 21.11
Version: hackathon_main
Backend: GPU

Dec 21 '21 14:12 iomaganaris

Logfiles from GitLab pipeline #30139 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 21 '21 16:12 bbpbuildbot

Logfiles from GitLab pipeline #30403 (:no_entry:) have been uploaded here!

Status and direct links:

Dec 23 '21 13:12 bbpbuildbot