Support for SHARED build with PGI OpenACC build
If we build shared library with OpenACC, we are getting:
cuda_init_device(threadid=1, device 0) dindex=1, api_context=0x95d090
pgi_uacc_cuda_static hostptr=0x2aaaabedc990-0x2aaaabedc998 size=8 name=_ZN10coreneuron14_slist1_ExpSynE flags=0x100
Failing in Thread:1
call to cuModuleGetGlobal returned error 500: Not found
This is quite painful ! All the details about the issue in : https://forums.developer.nvidia.com/t/clarification-on-using-openacc-in-a-shared-library/136279
And second example / question : https://forums.developer.nvidia.com/t/problem-with-openacc-with-variable-initialization/134612/2
Due to the lack of fix in PGI compiler, I am closing this issue. For now, we are creating static library with nrnivmodl-core and that gets linked to neuron. NEURON users are now supposed use nrnivmodl -coreneuron mod which internally calls nrnivmodl-core and links to CoreNEURON mechanism library. This way we avoid need of shared library support.
As this topics reappears often for practical usage NEURON GPU support (e.g. see internal neuron slack discussion), I am opening this ticket to discuss & prototype strategies going forward. Overall goal is to enable shared library support by avoiding current limitation from the use of global variable + shared library in OpenACC.
-
See this comment on Nvidia forum. The linked example summarises the key issue i.e. shared library can't be loaded/linked if we use
acc declareclause. -
Even we separate CPU code vs GPU code completely and build all GPU/OpenACC code into a shared library, it won't help.
-
I am wondering if we could get rid off
acc decalre~and entirely relyenter dataclauses~ somehow.- Current usage of global variables from coreneuron side is minimal:
$ grep -r "acc(declare" coreneuron/coreneuron/* coreneuron/nrnconf.h:nrn_pragma_acc(declare create(celsius)) coreneuron/nrnconf.h:nrn_pragma_acc(declare create(pi)) coreneuron/nrnconf.h:nrn_pragma_acc(declare create(secondorder))and mod2c usage is:
✗ grep -r "acc declare" external/mod2c/src/* external/mod2c/src/mod2c_core/parse1.y: "#pragma acc declare copyin(%s)\n" external/mod2c/src/mod2c_core/parse1.y: Sprintf(buf, "#pragma acc declare copyin(%s)\n", external/mod2c/src/mod2c_core/parse1.y: Sprintf(buf, "#pragma acc declare copyin(%s)\n", external/mod2c/src/mod2c_core/nocpout.c: "#pragma acc declare copyin(_celsius_)\n", suffix); external/mod2c/src/mod2c_core/nocpout.c: "#pragma acc declare copyin (_mechtype)\n" external/mod2c/src/mod2c_core/nocpout.c: Sprintf(buf, "#pragma acc declare copyin (%s,%d)\n", s->name, s->araydim); external/mod2c/src/mod2c_core/nocpout.c: Sprintf(buf, "#pragma acc declare copyin (%s)\n", s->name); external/mod2c/src/mod2c_core/nocpout.c: "#pragma acc declare create(%s)\n", external/mod2c/src/mod2c_core/nocpout.c: "#pragma acc declare copyin(%s)\n", external/mod2c/src/mod2c_core/kinetic.c: "#pragma acc declare create(_slist%d)\n" external/mod2c/src/mod2c_core/kinetic.c: "#pragma acc declare create(_dlist%d)\n" external/mod2c/src/mod2c_core/deriv.c: "#pragma acc declare create(_slist%d)\n" external/mod2c/src/mod2c_core/deriv.c: "#pragma acc declare create(_dlist%d)\n" external/mod2c/src/mod2c_core/simultan.c: "#pragma acc declare create(_slist%d)\n"note that mod2c global variable usage in generated
.cppfile is at file scope. So we might be able to get away with justenter data copying?
One of the example that I am fiddling with in order to find out how close we have to go:
$ cat main.cpp
#include <dlfcn.h>
#include <iostream>
int main() {
// vanilla thing without any error checking
void *handle = dlopen("./test.so", RTLD_LAZY);
typedef void (*init_t)(double);
typedef void (*sample_t)();
init_t init = (init_t)dlsym(handle, "init");
sample_t sample = (sample_t)dlsym(handle, "sample");
init(5.5);
sample();
dlclose(handle);
return 0;
}
$ cat test.cpp
#include <stdio.h>
#include <stdlib.h>
// there two global variables are now wrapped into struct
struct Foo {
float my_data[3];
double my_celsius = 1;
};
// we have additional pointer X because we can't say `acc ... present(global_data)`
Foo global_data;
Foo *X;
// these macros serve easy code generation / redirection to pointer variable
#define my_data X->my_data
#define my_celsius X->my_celsius
// setup function we need to have
extern "C" void init(double val) {
X = &global_data;
my_data[0] = 1.1;
my_data[1] = 2.1;
my_data[2] = 3.1;
my_celsius = val;
#pragma acc enter data copyin(X [0:1])
my_celsius = 2;
}
// functions now explicitly need to receive *X, this will be a change in prototype declaration macro
#pragma acc routine seq
inline double foo(Foo *X) {
return my_celsius + my_data[0] + my_data[1] + my_data[2];
}
// just sample kernel
extern "C" void sample() {
printf("CPU: my_celsius = %lf \n", my_celsius);
#pragma acc kernels present(X)
{
double val = foo(X);
printf("GPU: val = %lf , my_celsius = %lf \n", val, my_celsius);
}
}
// with or without nordc doesn't matter now
$ cat build.sh
rm -rf a.out test.so *.o
# Build 1
pgc++ -acc -ta=tesla:nordc -Minfo test.cpp -c -fPIC
pgc++ -acc -ta=tesla:nordc -shared -o test.so test.o
pgc++ -acc -ta=tesla:nordc main.cpp
./a.out
rm -f a.out test.so *.o
# Build 2
pgc++ -acc -Minfo test.cpp -c -fPIC
pgc++ -acc -shared -o test.so test.o
pgc++ -acc main.cpp
./a.out
For future reference, with some ongoing patches I am doing to enable shared library, end-up with wall of following errors:
__fatbinwrap_98_tmpxft_00028c08_00000000_8_...
...
And after spending significant time, this nvforum thread came to rescue: https://forums.developer.nvidia.com/t/separate-compilation-of-mixed-cuda-openacc-code/192701 !! 🙄
The prototype (very much WIP) looks promising and able to run ringtest via python with shared libraries:
$ ls -lrt x86_64/
total 1162
-rw-r--r-- 1 kumbhar bbp 357 Apr 3 00:52 makemod2c_inc
-rw-r--r-- 1 kumbhar bbp 421 Apr 3 00:52 mod_func.cpp
drwxr-xr-x 4 kumbhar bbp 4096 Apr 3 00:52 corenrn
-rwxr-xr-x 1 kumbhar bbp 1005256 Apr 3 00:53 libcorenrnmech.so
-rwxr-xr-x 1 kumbhar bbp 24088 Apr 3 00:53 special-core
-rw-r--r-- 1 kumbhar bbp 9638 Apr 3 00:53 halfgap.c
-rw-r--r-- 1 kumbhar bbp 5512 Apr 3 00:53 mod_func.o
-rw-r--r-- 1 kumbhar bbp 32184 Apr 3 00:53 halfgap.o
-rwxr-xr-x 1 kumbhar bbp 36576 Apr 3 00:53 libnrnmech.so
-rw-r--r-- 1 kumbhar bbp 16272 Apr 3 00:53 nrnmain.o
-rwxr-xr-x 1 kumbhar bbp 34048 Apr 3 00:53 special
kumbhar@ldir01u09:~/workarena/systems/bbpv/repos/bbp/ringtest$ python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
num_mpi=1
num_omp_thread=1
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 b938894 (2022-04-03 00:44:34 +0200)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 202.6094, Min 202.6094, Avg 202.6094
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : After MPI_Init : Max 202.6094, Min 202.6094, Avg 202.6094
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : Before nrn_setup : Max 202.6094, Min 202.6094, Avg 202.6094
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Setup Done : 0.00 seconds
Model size : 1.09 MB
Memory (MBs) : After nrn_setup : Max 202.9258, Min 202.9258, Avg 202.9258
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1
GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false
INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 202.9258, Min 202.9258, Avg 202.9258
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : After nrn_finitialize : Max 267.3008, Min 267.3008, Avg 267.3008
GPU Memory (MBs) : Used = 311.000000, Free = 32199.500000, Total = 32510.500000
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.0157009
Simulation Statistics
Number of cells: 128
Number of compartments: 4576
Number of presyns: 144
Number of input presyns: 0
Number of synapses: 144
Number of point processes: 272
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 0
Number of spikes with non negative gid-s: 0
3.07s run
0.00999999s wrote 0 spikes
runtime=3.07 load_balance=0.0% avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0
Note that just nvprof doesn't work:
$ nvprof python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
==33325== NVPROF is profiling process 33325, command: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
num_mpi=1
num_omp_thread=1
Missing definition of the OpenACC API routine/s in the OpenACC library linked to the application. To work around this issue either force the inclusion of all the OpenACC symbols in the binary or link the OpenACC library dynamically.
But this seems to be upstream issue with nvprof as reported in https://github.com/illuhad/hipSYCL/issues/598 and https://github.com/JuliaGPU/CUDA.jl/issues/1283.
Using nvprof --openacc-profiling off works fine:
$ nvprof --openacc-profiling off python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
numprocs=1
[10, 20] [1, 1]
nring=16
cell per ring=8
ncell_per_type=8
ntype=16
0.02s created rings
0.00999999s initialized
==34153== NVPROF is profiling process 34153, command: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
num_mpi=1
num_omp_thread=1
Info : 4 GPUs shared by 1 ranks per node
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2020
Version : 1.0 b938894 (2022-04-03 00:44:34 +0200)
Additional mechanisms from files
exp2syn.mod expsyn.mod halfgap.mod hh.mod netstim.mod passive.mod pattern.mod stim.mod svclmp.mod
Memory (MBs) : After mk_mech : Max 234.3711, Min 234.3711, Avg 234.3711
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : After MPI_Init : Max 234.4766, Min 234.4766, Avg 234.4766
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : Before nrn_setup : Max 234.5117, Min 234.5117, Avg 234.5117
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Setup Done : 0.00 seconds
Model size : 1.09 MB
Memory (MBs) : After nrn_setup : Max 234.5117, Min 234.5117, Avg 234.5117
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
GENERAL PARAMETERS
--mpi=true
--mpi-lib=
--gpu=true
--dt=0.025
--tstop=1
GPU
--nwarp=65536
--cell-permute=1
--cuda-interface=false
INPUT PARAMETERS
--voltage=1000
--seed=-1
--datpath=.
--filesdat=files.dat
--pattern=
--report-conf=
--restore=
PARALLEL COMPUTATION PARAMETERS
--threading=false
--skip_mpi_finalize=true
SPIKE EXCHANGE
--ms_phases=2
--ms_subintervals=2
--multisend=false
--spk_compress=0
--binqueue=false
CONFIGURATION
--spikebuf=100000
--prcellgid=-1
--forwardskip=0
--celsius=6.3
--mindelay=10
--report-buffer-size=4
OUTPUT PARAMETERS
--dt_io=0.1
--outpath=.
--checkpoint=
Start time (t) = 0
Memory (MBs) : After mk_spikevec_buffer : Max 234.5117, Min 234.5117, Avg 234.5117
GPU Memory (MBs) : Used = 309.000000, Free = 32201.500000, Total = 32510.500000
Memory (MBs) : After nrn_finitialize : Max 299.1445, Min 299.1445, Avg 299.1445
GPU Memory (MBs) : Used = 311.000000, Free = 32199.500000, Total = 32510.500000
psolve |=========================================================| t: 1.00 ETA: 0h00m00s
Solver Time : 0.0206989
Simulation Statistics
Number of cells: 128
Number of compartments: 4576
Number of presyns: 144
Number of input presyns: 0
Number of synapses: 144
Number of point processes: 272
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 0
Number of spikes with non negative gid-s: 0
2.53s run
0s wrote 0 spikes
runtime=2.53 load_balance=0.0% avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0
==34153== Profiling application: python ringtest.py -coreneuron -gpu -permute 1 -tstop 1
==34153== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 46.19% 2.2915ms 40 57.288us 56.383us 57.983us coreneuron::solve_interleaved1_653_gpu(int)
6.64% 329.47us 40 8.2360us 8.0950us 10.976us coreneuron::_nrn_state__hh_593_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
5.56% 275.77us 40 6.8940us 6.7830us 7.3600us coreneuron::_nrn_cur__hh_537_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
5.20% 258.01us 80 3.2250us 3.1350us 3.7750us coreneuron::nrn_cur_ion_265_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
3.93% 194.91us 40 4.8720us 4.7990us 6.1120us coreneuron::NetCvode::check_thresh_536_gpu(coreneuron::NrnThread*)
3.63% 180.19us 40 4.5040us 4.4470us 4.9920us coreneuron::_nrn_cur__ExpSyn_507_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
3.18% 157.63us 40 3.9400us 3.9030us 4.2880us coreneuron::_nrn_cur__pas_320_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
3.15% 156.25us 40 3.9060us 3.8390us 4.3840us coreneuron::_nrn_state__ExpSyn_584_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
2.70% 133.82us 40 3.3450us 3.2950us 3.6160us _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_rhs_83_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
2.68% 133.02us 40 3.3250us 3.2310us 3.7760us coreneuron::nrn_jacob_capacitance_75_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
2.68% 132.93us 40 3.3230us 3.2640us 3.7120us coreneuron::nrn_cur_capacitance_115_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
2.57% 127.32us 40 3.1830us 3.1350us 3.4880us _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_lhs_160_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
2.49% 123.36us 40 3.0830us 3.0390us 3.5520us coreneuron::update_214_gpu(coreneuron::NrnThread*)
2.33% 115.52us 40 2.8870us 2.8470us 3.1360us coreneuron::_net_buf_receive_ExpSyn_343_gpu(coreneuron::NrnThread*)
2.32% 115.17us 40 2.8790us 2.8470us 3.2960us _INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::nrn_rhs_33_gpu(_INTERNAL_119__gpfs_bbp_cscs_ch_home_kumbhar_workarena_systems_bbpv_repos_bbp_nrn_external_coreneuron_coreneuron_sim_treeset_core_cpp_45635c46::coreneuron::NrnThread*)
2.31% 114.72us 80 1.4330us 1.4070us 1.9840us [CUDA memcpy HtoD]
1.27% 63.199us 40 1.5790us 1.5350us 2.3680us [CUDA memcpy DtoH]
1.17% 57.948us 40 1.4480us 1.4070us 1.9840us [CUDA memset]
API calls: 68.05% 139.46ms 1 139.46ms 139.46ms 139.46ms cuDevicePrimaryCtxRetain
16.15% 33.095ms 1 33.095ms 33.095ms 33.095ms cuDevicePrimaryCtxRelease
6.00% 12.291ms 1 12.291ms 12.291ms 12.291ms cuMemFreeHost
4.84% 9.9172ms 2120 4.6770us 615ns 64.159us cuStreamSynchronize
1.42% 2.9034ms 4 725.86us 723.46us 730.89us cuDeviceTotalMem
1.29% 2.6414ms 640 4.1270us 3.3240us 39.978us cuLaunchKernel
0.72% 1.4740ms 9 163.78us 63.881us 550.87us cuModuleLoadDataEx
0.59% 1.2178ms 412 2.9550us 121ns 137.04us cuDeviceGetAttribute
0.23% 469.74us 1 469.74us 469.74us 469.74us cuPointerGetAttributes
0.18% 365.32us 280 1.3040us 687ns 5.1330us cuCtxSynchronize
0.16% 327.53us 1 327.53us 327.53us 327.53us cudaFree
0.14% 290.16us 80 3.6270us 2.7900us 31.853us cuMemcpyHtoDAsync
0.08% 162.03us 40 4.0500us 3.5820us 7.9390us cuMemcpyDtoHAsync
0.06% 119.75us 40 2.9930us 2.4350us 14.796us cuMemsetD32Async
0.06% 117.54us 4 29.385us 26.976us 35.111us cuDeviceGetName
0.02% 40.819us 1 40.819us 40.819us 40.819us cudaMemGetInfo
0.01% 13.925us 29 480ns 272ns 890ns cuModuleGetFunction
0.01% 11.820us 1 11.820us 11.820us 11.820us cuMemAlloc
0.01% 10.284us 4 2.5710us 925ns 5.9480us cuDeviceGetPCIBusId
0.00% 4.2480us 6 708ns 149ns 2.1110us cuCtxSetCurrent
0.00% 4.0800us 12 340ns 118ns 1.9800us cuDeviceGet
0.00% 2.9660us 3 988ns 578ns 1.6810us cuCtxGetCurrent
0.00% 1.9650us 1 1.9650us 1.9650us 1.9650us cuInit
0.00% 1.4000us 4 350ns 145ns 559ns cuDeviceGetCount
0.00% 1.1430us 4 285ns 184ns 581ns cuDeviceComputeCapability
0.00% 665ns 4 166ns 136ns 226ns cuDeviceGetUuid
0.00% 644ns 2 322ns 281ns 363ns cuCtxGetDevice
0.00% 208ns 1 208ns 208ns 208ns cuDriverGetVersion
One funny behaviour is that when spike exchange is happening then we see segfault if launched with python but the same library/executable works if launched with special:
$ cuda-memcheck x86_64/special -python ringtest.py -coreneuron -gpu -permute 1 -tstop 1.03
....
Solver Time : 1.91957
Simulation Statistics
Number of cells: 128
Number of compartments: 4576
Number of presyns: 144
Number of input presyns: 0
Number of synapses: 144
Number of point processes: 272
Number of transfer sources: 0
Number of transfer targets: 0
Number of spikes: 0
Number of spikes with non negative gid-s: 0
1.97s run
0s wrote 0 spikes
runtime=1.97 load_balance=0.0% avg_comp_time=0
spk_time max=0 min=0
gap_time max=0 min=0
========= ERROR SUMMARY: 0 errors
# BUT
$ cuda-memcheck python ringtest.py -coreneuron -gpu -permute 1 -tstop 1.03
...
========= Host Frame:python [0x108e]
=========
========= Invalid __global__ write of size 8
========= at 0x00000420 in /gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/corenrn/mod2c/expsyn.cpp:355:coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*)
========= by thread (14,0,0) in block (0,0,0)
========= Address 0x05e94f38 is out of bounds
========= Device Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/corenrn/mod2c/expsyn.cpp:355:coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*) (coreneuron::_net_buf_receive_ExpSyn_345_gpu(coreneuron::NrnThread*) : 0x420)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 [0x20d6ea]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so (__pgi_uacc_cuda_launch3 + 0x1d96) [0x1e418]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so [0x224dc]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacccuda.so (__pgi_uacc_cuda_launch + 0x13d) [0x2261b]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f8) [0x48418]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/libcorenrnmech.so (_ZN10coreneuron23_net_buf_receive_ExpSynEPNS_9NrnThreadE + 0x55d) [0xc11d]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron18nrn_deliver_eventsEPNS_9NrnThreadE + 0x95) [0x1ee855]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron23nrn_fixed_step_lastpartEPNS_9NrnThreadE + 0x188) [0x2b17c8]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2b1618]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2af878]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so [0x2af826]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x1b8f5]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x4bf49]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x3e542]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x3e453]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x1b1b3]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so [0x18456]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/nvhpc-22.2-gou6qx/Linux_x86_64/22.2/compilers/lib/libnvomp.so (__kmpc_fork_call + 0x9e) [0x3291e]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron28nrn_fixed_step_group_minimalEi + 0x145) [0x2af685]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron17ncs2nrn_integrateEd + 0x41) [0x1f7f41]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (_ZN10coreneuron16BBS_netpar_solveEd + 0xe9) [0x215929]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libcoreneuron.so (run_solve_core + 0x671) [0xbb7b1]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/ringtest/x86_64/libcorenrnmech.so (corenrn_embedded_run + 0x8e) [0x75ce]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_Z14nrncore_psolvedi + 0x229) [0x1e6469]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x20fcd1]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_Z20hoc_object_componentv + 0xc4e) [0x2aa30e]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x4252fb]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so (_ZN6OcJump7fpycallEPFPvS0_S0_ES0_S0_ + 0x22f) [0x2140af]
========= Host Frame:/gpfs/bbp.cscs.ch/home/kumbhar/workarena/systems/bbpv/repos/bbp/nrn/build_gpu/install/lib/libnrniv.so [0x4259b2]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyObject_MakeTpCall + 0x88) [0xb6788]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalFrameDefault + 0x9e38) [0x74958]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x69bc9]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalFrameDefault + 0x6928) [0x71448]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x19d4ac]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (_PyEval_EvalCodeWithName + 0x4e) [0x19d9be]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyEval_EvalCodeEx + 0x3b) [0x19da0b]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyEval_EvalCode + 0x1b) [0x19da3b]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 [0x1dd59e]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (PyRun_SimpleFileExFlags + 0x191) [0x1defb1]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (Py_RunMain + 0x757) [0x1fc227]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/bsd/2022-01-10/stage_externals/install_gcc-11.2.0-skylake/python-3.9.7-yj5alh/lib/libpython3.9.so.1.0 (Py_BytesMain + 0x47) [0x1fc6e7]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x22555]
========= Host Frame:python [0x108e]
...
Need to check why net_buf_receive_ExpSyn would fail with python but not special...
EDIT
Issue comes from the following code:
static Memb_list* copy_ml_to_device(const Memb_list* ml, int type) {
// As we never run code for artificial cell inside GPU we don't copy it.
int is_art = corenrn.get_is_artificial()[type];
if (is_art) {
return nullptr;
}
auto d_ml = cnrn_target_copyin(ml);
int n = ml->nodecount;
int szp = corenrn.get_prop_param_size()[type];
int szdp = corenrn.get_prop_dparam_size()[type];
double* dptr = cnrn_target_deviceptr(ml->data);
the dptr is supposed to be device pointer for ml->data (which is pointer inside large nt->_data).
- In case of
specialas an executable,dptris expected device pointer - In case of
pythonas an executable,dptris host pointer same asml->data🙄 (to be investigated later....)