CoreNeuron Support for SHARED build with PGI OpenACC build

If we build shared library with OpenACC, we are getting:

cuda_init_device(threadid=1, device 0) dindex=1, api_context=0x95d090
pgi_uacc_cuda_static hostptr=0x2aaaabedc990-0x2aaaabedc998 size=8 name=_ZN10coreneuron14_slist1_ExpSynE flags=0x100
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found

Feb 26 '19 23:02 pramodk

This is quite painful ! All the details about the issue in : https://forums.developer.nvidia.com/t/clarification-on-using-openacc-in-a-shared-library/136279

Jul 12 '20 00:07 pramodk

And second example / question : https://forums.developer.nvidia.com/t/problem-with-openacc-with-variable-initialization/134612/2

Jul 12 '20 08:07 pramodk

Due to the lack of fix in PGI compiler, I am closing this issue. For now, we are creating static library with nrnivmodl-core and that gets linked to neuron. NEURON users are now supposed use nrnivmodl -coreneuron mod which internally calls nrnivmodl-core and links to CoreNEURON mechanism library. This way we avoid need of shared library support.

Nov 30 '20 09:11 pramodk

As this topics reappears often for practical usage NEURON GPU support (e.g. see internal neuron slack discussion), I am opening this ticket to discuss & prototype strategies going forward. Overall goal is to enable shared library support by avoiding current limitation from the use of global variable + shared library in OpenACC.

See this comment on Nvidia forum. The linked example summarises the key issue i.e. shared library can't be loaded/linked if we use acc declare clause.
Even we separate CPU code vs GPU code completely and build all GPU/OpenACC code into a shared library, it won't help.

I am wondering if we could get rid off acc decalre ~and entirely rely enter data clauses~ somehow.

Current usage of global variables from coreneuron side is minimal:

$ grep -r "acc(declare" coreneuron/coreneuron/*
coreneuron/nrnconf.h:nrn_pragma_acc(declare create(celsius))
coreneuron/nrnconf.h:nrn_pragma_acc(declare create(pi))
coreneuron/nrnconf.h:nrn_pragma_acc(declare create(secondorder))

and mod2c usage is:

 ✗ grep -r "acc declare" external/mod2c/src/*
 external/mod2c/src/mod2c_core/parse1.y:			"#pragma acc declare copyin(%s)\n"
 external/mod2c/src/mod2c_core/parse1.y:		    Sprintf(buf, "#pragma acc declare copyin(%s)\n",
 external/mod2c/src/mod2c_core/parse1.y:		    Sprintf(buf, "#pragma acc declare copyin(%s)\n",
 external/mod2c/src/mod2c_core/nocpout.c:                "#pragma acc declare copyin(_celsius_)\n", suffix);
 external/mod2c/src/mod2c_core/nocpout.c:		  "#pragma acc declare copyin (_mechtype)\n"
 external/mod2c/src/mod2c_core/nocpout.c:				Sprintf(buf, "#pragma acc declare copyin (%s,%d)\n", s->name, s->araydim);
 external/mod2c/src/mod2c_core/nocpout.c:				Sprintf(buf, "#pragma acc declare copyin (%s)\n", s->name);
 external/mod2c/src/mod2c_core/nocpout.c:                             "#pragma acc declare create(%s)\n",
 external/mod2c/src/mod2c_core/nocpout.c:                             "#pragma acc declare copyin(%s)\n",
 external/mod2c/src/mod2c_core/kinetic.c:	  "#pragma acc declare create(_slist%d)\n"
 external/mod2c/src/mod2c_core/kinetic.c:	  "#pragma acc declare create(_dlist%d)\n"
 external/mod2c/src/mod2c_core/deriv.c:	  "#pragma acc declare create(_slist%d)\n"
 external/mod2c/src/mod2c_core/deriv.c:	  "#pragma acc declare create(_dlist%d)\n"
 external/mod2c/src/mod2c_core/simultan.c:	  "#pragma acc declare create(_slist%d)\n"

note that mod2c global variable usage in generated .cpp file is at file scope. So we might be able to get away with just enter data copying?

Apr 02 '22 05:04 pramodk

One of the example that I am fiddling with in order to find out how close we have to go:

$ cat main.cpp

#include <dlfcn.h>
#include <iostream>

int main() {

  // vanilla thing without any error checking

  void *handle = dlopen("./test.so", RTLD_LAZY);

  typedef void (*init_t)(double);
  typedef void (*sample_t)();

  init_t init = (init_t)dlsym(handle, "init");
  sample_t sample = (sample_t)dlsym(handle, "sample");

  init(5.5);
  sample();

  dlclose(handle);
  return 0;
}

$ cat test.cpp

#include <stdio.h>
#include <stdlib.h>

// there two global variables are now wrapped into struct
struct Foo {
  float my_data[3];
  double my_celsius = 1;
};

// we have additional pointer X because we can't say `acc ... present(global_data)`
Foo global_data;
Foo *X;

// these macros serve easy code generation / redirection to pointer variable
#define my_data X->my_data
#define my_celsius X->my_celsius


// setup function we need to have
extern "C" void init(double val) {
  X = &global_data;
  my_data[0] = 1.1;
  my_data[1] = 2.1;
  my_data[2] = 3.1;
  my_celsius = val;

#pragma acc enter data copyin(X [0:1])

  my_celsius = 2;
}

// functions now explicitly need to receive *X, this will be a change in prototype declaration macro
#pragma acc routine seq
inline double foo(Foo *X) {
  return my_celsius + my_data[0] + my_data[1] + my_data[2];
}

// just sample kernel
extern "C" void sample() {
  printf("CPU: my_celsius = %lf \n", my_celsius);

  #pragma acc kernels present(X)
  {
    double val = foo(X);
    printf("GPU: val = %lf , my_celsius = %lf \n", val, my_celsius);
  }
}


// with or without nordc doesn't matter now
$ cat build.sh

rm -rf a.out test.so *.o

# Build 1
pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
pgc++ -acc -ta=tesla:nordc main.cpp
./a.out
rm -f a.out test.so *.o

# Build 2
pgc++ -acc -Minfo test.cpp -c -fPIC
pgc++ -acc -shared -o test.so test.o
pgc++ -acc main.cpp
./a.out

Apr 02 '22 10:04 pramodk

For future reference, with some ongoing patches I am doing to enable shared library, end-up with wall of following errors:

__fatbinwrap_98_tmpxft_00028c08_00000000_8_...
...

And after spending significant time, this nvforum thread came to rescue: https://forums.developer.nvidia.com/t/separate-compilation-of-mixed-cuda-openacc-code/192701 !! 🙄

The prototype (very much WIP) looks promising and able to run ringtest via python with shared libraries:


$ ls -lrt x86_64/
total 1162
-rw-r--r-- 1 kumbhar bbp     357 Apr  3 00:52 makemod2c_inc
-rw-r--r-- 1 kumbhar bbp     421 Apr  3 00:52 mod_func.cpp
drwxr-xr-x 4 kumbhar bbp    4096 Apr  3 00:52 corenrn
-rwxr-xr-x 1 kumbhar bbp 1005256 Apr  3 00:53 libcorenrnmech.so
-rwxr-xr-x 1 kumbhar bbp   24088 Apr  3 00:53 special-core
-rw-r--r-- 1 kumbhar bbp    9638 Apr  3 00:53 halfgap.c
-rw-r--r-- 1 kumbhar bbp    5512 Apr  3 00:53 mod_func.o
-rw-r--r-- 1 kumbhar bbp   32184 Apr  3 00:53 halfgap.o
-rwxr-xr-x 1 kumbhar bbp   36576 Apr  3 00:53 libnrnmech.so
-rw-r--r-- 1 kumbhar bbp   16272 Apr  3 00:53 nrnmain.o
-rwxr-xr-x 1 kumbhar bbp   34048 Apr  3 00:53 special

kumbhar@ldir01u09:~/workarena/systems/bbpv/repos/bbp/ringtest$ python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
 num_mpi=1
 num_omp_thread=1

 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 b938894 (2022-04-03 00:44:34 +0200)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 202.6094, Min 202.6094, Avg 202.6094
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :            After MPI_Init : Max 202.6094, Min 202.6094, Avg 202.6094
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :          Before nrn_setup : Max 202.6094, Min 202.6094, Avg 202.6094
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Setup Done   : 0.00 seconds
 Model size   : 1.09 MB
 Memory (MBs) :          After nrn_setup  : Max 202.9258, Min 202.9258, Avg 202.9258
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1

GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false

INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 202.9258, Min 202.9258, Avg 202.9258
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :     After nrn_finitialize : Max 267.3008, Min 267.3008, Avg 267.3008
 GPU Memory (MBs) : Used = 311.000000, Free = 32199.500000, Total = 32510.500000

psolve |=========================================================| t: 1.00   ETA: 0h00m00s

Solver Time : 0.0157009


 Simulation Statistics
 Number of cells: 128
 Number of compartments: 4576
 Number of presyns: 144
 Number of input presyns: 0
 Number of synapses: 144
 Number of point processes: 272
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 0
 Number of spikes with non negative gid-s: 0
3.07s run
0.00999999s wrote 0 spikes
runtime=3.07  load_balance=0.0%  avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0

Note that just nvprof doesn't work:

$ nvprof python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
==33325== NVPROF is profiling process 33325, command: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
 num_mpi=1
 num_omp_thread=1

Missing definition of the OpenACC API routine/s in the OpenACC library linked to the application. To work around this issue either force the inclusion of all the OpenACC symbols in the binary or link the OpenACC library dynamically.

But this seems to be upstream issue with nvprof as reported in https://github.com/illuhad/hipSYCL/issues/598 and https://github.com/JuliaGPU/CUDA.jl/issues/1283.

Using nvprof --openacc-profiling off works fine:

$ nvprof --openacc-profiling off python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
==34153== NVPROF is profiling process 34153, command: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
 num_mpi=1
 num_omp_thread=1

 Info : 4 GPUs shared by 1 ranks per node

 Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
 Version : 1.0 b938894 (2022-04-03 00:44:34 +0200)

 Additional mechanisms from files
 exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod

 Memory (MBs) :             After mk_mech : Max 234.3711, Min 234.3711, Avg 234.3711
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :            After MPI_Init : Max 234.4766, Min 234.4766, Avg 234.4766
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :          Before nrn_setup : Max 234.5117, Min 234.5117, Avg 234.5117
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Setup Done   : 0.00 seconds
 Model size   : 1.09 MB
 Memory (MBs) :          After nrn_setup  : Max 234.5117, Min 234.5117, Avg 234.5117
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1

GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false

INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=

PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true

SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false

CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4

OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=

 Start time (t) = 0

 Memory (MBs) :  After mk_spikevec_buffer : Max 234.5117, Min 234.5117, Avg 234.5117
 GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
 Memory (MBs) :     After nrn_finitialize : Max 299.1445, Min 299.1445, Avg 299.1445
 GPU Memory (MBs) : Used = 311.000000, Free = 32199.500000, Total = 32510.500000

psolve |=========================================================| t: 1.00   ETA: 0h00m00s

Solver Time : 0.0206989


 Simulation Statistics
 Number of cells: 128
 Number of compartments: 4576
 Number of presyns: 144
 Number of input presyns: 0
 Number of synapses: 144
 Number of point processes: 272
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 0
 Number of spikes with non negative gid-s: 0
2.53s run
0s wrote 0 spikes
runtime=2.53  load_balance=0.0%  avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0
==34153== Profiling application: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
==34153== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   46.19%  2.2915ms        40  57.288us  56.383us  57.983us  coreneuron::solve_interleaved1_653_gpu(int)
                    6.64%  329.47us        40  8.2360us  8.0950us  10.976us  coreneuron::_nrn_state__hh_593_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    5.56%  275.77us        40  6.8940us  6.7830us  7.3600us  coreneuron::_nrn_cur__hh_537_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    5.20%  258.01us        80  3.2250us  3.1350us  3.7750us  coreneuron::nrn_cur_ion_265_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    3.93%  194.91us        40  4.8720us  4.7990us  6.1120us  coreneuron::NetCvode::check_thresh_536_gpu(coreneuron::NrnThread*)
                    3.63%  180.19us        40  4.5040us  4.4470us  4.9920us  coreneuron::_nrn_cur__ExpSyn_507_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    3.18%  157.63us        40  3.9400us  3.9030us  4.2880us  coreneuron::_nrn_cur__pas_320_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    3.15%  156.25us        40  3.9060us  3.8390us  4.3840us  coreneuron::_nrn_state__ExpSyn_584_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    2.70%  133.82us        40  3.3450us  3.2950us  3.6160us  _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_rhs_83_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
                    2.68%  133.02us        40  3.3250us  3.2310us  3.7760us  coreneuron::nrn_jacob_capacitance_75_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    2.68%  132.93us        40  3.3230us  3.2640us  3.7120us  coreneuron::nrn_cur_capacitance_115_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
                    2.57%  127.32us        40  3.1830us  3.1350us  3.4880us  _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_lhs_160_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
                    2.49%  123.36us        40  3.0830us  3.0390us  3.5520us  coreneuron::update_214_gpu(coreneuron::NrnThread*)
                    2.33%  115.52us        40  2.8870us  2.8470us  3.1360us  coreneuron::_net_buf_receive_ExpSyn_343_gpu(coreneuron::NrnThread*)
                    2.32%  115.17us        40  2.8790us  2.8470us  3.2960us  _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_rhs_33_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
                    2.31%  114.72us        80  1.4330us  1.4070us  1.9840us  [CUDA memcpy HtoD]
                    1.27%  63.199us        40  1.5790us  1.5350us  2.3680us  [CUDA memcpy DtoH]
                    1.17%  57.948us        40  1.4480us  1.4070us  1.9840us  [CUDA memset]
      API calls:   68.05%  139.46ms         1  139.46ms  139.46ms  139.46ms  cuDevicePrimaryCtxRetain
                   16.15%  33.095ms         1  33.095ms  33.095ms  33.095ms  cuDevicePrimaryCtxRelease
                    6.00%  12.291ms         1  12.291ms  12.291ms  12.291ms  cuMemFreeHost
                    4.84%  9.9172ms      2120  4.6770us     615ns  64.159us  cuStreamSynchronize
                    1.42%  2.9034ms         4  725.86us  723.46us  730.89us  cuDeviceTotalMem
                    1.29%  2.6414ms       640  4.1270us  3.3240us  39.978us  cuLaunchKernel
                    0.72%  1.4740ms         9  163.78us  63.881us  550.87us  cuModuleLoadDataEx
                    0.59%  1.2178ms       412  2.9550us     121ns  137.04us  cuDeviceGetAttribute
                    0.23%  469.74us         1  469.74us  469.74us  469.74us  cuPointerGetAttributes
                    0.18%  365.32us       280  1.3040us     687ns  5.1330us  cuCtxSynchronize
                    0.16%  327.53us         1  327.53us  327.53us  327.53us  cudaFree
                    0.14%  290.16us        80  3.6270us  2.7900us  31.853us  cuMemcpyHtoDAsync
                    0.08%  162.03us        40  4.0500us  3.5820us  7.9390us  cuMemcpyDtoHAsync
                    0.06%  119.75us        40  2.9930us  2.4350us  14.796us  cuMemsetD32Async
                    0.06%  117.54us         4  29.385us  26.976us  35.111us  cuDeviceGetName
                    0.02%  40.819us         1  40.819us  40.819us  40.819us  cudaMemGetInfo
                    0.01%  13.925us        29     480ns     272ns     890ns  cuModuleGetFunction
                    0.01%  11.820us         1  11.820us  11.820us  11.820us  cuMemAlloc
                    0.01%  10.284us         4  2.5710us     925ns  5.9480us  cuDeviceGetPCIBusId
                    0.00%  4.2480us         6     708ns     149ns  2.1110us  cuCtxSetCurrent
                    0.00%  4.0800us        12     340ns     118ns  1.9800us  cuDeviceGet
                    0.00%  2.9660us         3     988ns     578ns  1.6810us  cuCtxGetCurrent
                    0.00%  1.9650us         1  1.9650us  1.9650us  1.9650us  cuInit
                    0.00%  1.4000us         4     350ns     145ns     559ns  cuDeviceGetCount
                    0.00%  1.1430us         4     285ns     184ns     581ns  cuDeviceComputeCapability
                    0.00%     665ns         4     166ns     136ns     226ns  cuDeviceGetUuid
                    0.00%     644ns         2     322ns     281ns     363ns  cuCtxGetDevice
                    0.00%     208ns         1     208ns     208ns     208ns  cuDriverGetVersion

Apr 02 '22 23:04 pramodk

One funny behaviour is that when spike exchange is happening then we see segfault if launched with python but the same library/executable works if launched with special:

$ cuda-memcheck x86_64/special -python ringtest.py -coreneuron -gpu -permute 1 -tstop 1.03
....
Solver Time : 1.91957


 Simulation Statistics
 Number of cells: 128
 Number of compartments: 4576
 Number of presyns: 144
 Number of input presyns: 0
 Number of synapses: 144
 Number of point processes: 272
 Number of transfer sources: 0
 Number of transfer targets: 0
 Number of spikes: 0
 Number of spikes with non negative gid-s: 0
1.97s run
0s wrote 0 spikes
runtime=1.97  load_balance=0.0%  avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0
========= ERROR SUMMARY: 0 errors

# BUT

$ cuda-memcheck python ringtest.py -coreneuron -gpu -permute 1 -tstop 1.03
...
=========     Host Frame:python [0x108e]
=========
========= Invalid __global__ write of size 8
=========     at 0x00000420 in /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/corenrn/mod2c/expsyn.cpp:355:coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*)
=========     by thread (14,0,0) in block (0,0,0)
=========     Address 0x05e94f38 is out of bounds
=========     Device Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/corenrn/mod2c/expsyn.cpp:355:coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*) (coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*) : 0x420)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so.1 [0x20d6ea]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so (__pgi_uacc_cuda_launch3 + 0x1d96) [0x1e418]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so [0x224dc]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so (__pgi_uacc_cuda_launch + 0x13d) [0x2261b]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f8) [0x48418]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/libcorenrnmech.so (_ZN10coreneuron23_net_buf_receive_ExpSynEPNS_9NrnThreadE + 0x55d) [0xc11d]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron18nrn_deliver_eventsEPNS_9NrnThreadE + 0x95) [0x1ee855]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron23nrn_fixed_step_lastpartEPNS_9NrnThreadE + 0x188) [0x2b17c8]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2b1618]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2af878]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2af826]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x1b8f5]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x4bf49]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x3e542]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x3e453]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x1b1b3]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x18456]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so (__kmpc_fork_call + 0x9e) [0x3291e]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron28nrn_fixed_step_group_minimalEi + 0x145) [0x2af685]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron17ncs2nrn_integrateEd + 0x41) [0x1f7f41]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron16BBS_netpar_solveEd + 0xe9) [0x215929]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (run_solve_core + 0x671) [0xbb7b1]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/libcorenrnmech.so (corenrn_embedded_run + 0x8e) [0x75ce]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_Z14nrncore_psolvedi + 0x229) [0x1e6469]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x20fcd1]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_Z20hoc_object_componentv + 0xc4e) [0x2aa30e]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x4252fb]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_ZN6OcJump7fpycallEPFPvS0_S0_ES0_S0_ + 0x22f) [0x2140af]
=========     Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x4259b2]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyObject_MakeTpCall + 0x88) [0xb6788]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalFrameDefault + 0x9e38) [0x74958]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x69bc9]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalFrameDefault + 0x6928) [0x71448]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x19d4ac]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalCodeWithName + 0x4e) [0x19d9be]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyEval_EvalCodeEx + 0x3b) [0x19da0b]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyEval_EvalCode + 0x1b) [0x19da3b]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x1dd59e]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyRun_SimpleFileExFlags + 0x191) [0x1defb1]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (Py_RunMain + 0x757) [0x1fc227]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (Py_BytesMain + 0x47) [0x1fc6e7]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
=========     Host Frame:python [0x108e]
...

Need to check why net_buf_receive_ExpSyn would fail with python but not special...

EDIT

Issue comes from the following code:

static Memb_list* copy_ml_to_device(const Memb_list* ml, int type) {
    // As we never run code for artificial cell inside GPU we don't copy it.
    int is_art = corenrn.get_is_artificial()[type];
    if (is_art) {
        return nullptr;
    }

    auto d_ml = cnrn_target_copyin(ml);

    int n = ml->nodecount;
    int szp = corenrn.get_prop_param_size()[type];
    int szdp = corenrn.get_prop_dparam_size()[type];

    double* dptr = cnrn_target_deviceptr(ml->data);

the dptr is supposed to be device pointer for ml->data (which is pointer inside large nt->_data).

In case of special as an executable, dptr is expected device pointer
In case of python as an executable, dptr is host pointer same as ml->data 🙄 (to be investigated later....)

Apr 03 '22 00:04 pramodk