NEURON integrated tests failing with CUDA Unified Memory enabled
Describe the issue Some of the NEURON test are failing on GPU when CUDA Unified Memory is enabled in CoreNEURON. More precisely the tests that fail are:
The following tests FAILED:
18 - coreneuron_modtests::direct_py (Failed)
19 - coreneuron_modtests::direct_hoc (Failed)
20 - coreneuron_modtests::spikes_py (Failed)
21 - coreneuron_modtests::spikes_file_mode_py (Failed)
22 - coreneuron_modtests::datareturn_py (Failed)
25 - coreneuron_modtests::spikes_mpi_py (Failed)
26 - coreneuron_modtests::spikes_mpi_file_mode_py (Failed)
41 - testcorenrn_patstim::coreneuron_gpu_offline (Failed)
45 - testcorenrn_patstim::compare_results (Failed)
99 - testcorenrn_netstimdirect::direct (Failed)
100 - testcorenrn_netstimdirect::compare_results (Failed)
To Reproduce Steps to reproduce the behavior:
git clone [email protected]:neuronsimulator/nrn.git
cd nrn
mkdir build_unified && cd build_unified
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_TESTS=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CU
DA_UNIFIED_MEMORY=ON -DCORENRN_ENABLE_OPENMP=OFF
make -j16
ctest --output-on-failure
Expected behavior GPU tests should be passing with Unified Memory as well.
Logs
An example of a failing test (coreneuron_modtests::direct_py) when run with cuda-memcheck has the following output:
========= Invalid __global__ read of size 8
========= at 0x00000730 in /gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x05121860 is out of bounds
========= Device Frame:/gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) (coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) : 0x730)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so (cuLaunchKernel + 0x34e) [0x2efa6e]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch3 + 0x1d94) [0x1ca43]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so [0x1d7a5]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch + 0x13d) [0x1d8e4]
========= Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f7) [0x463c0]
========= Host Frame:./x86_64/special (_ZN10coreneuron16_nrn_cur__IClampEPNS_9NrnThreadEPNS_9Memb_listEi + 0x89b) [0x5702b]
========= Host Frame:./x86_64/special [0x17f3bb]
========= Host Frame:./x86_64/special (_ZN10coreneuron25setup_tree_matrix_minimalEPNS_9NrnThreadE + 0xe) [0x1814ae]
Failing in Thread:1
call to cuLaunchKernel returned error 719: Launch failed (often invalid pointer dereference)
The corresponding line that fails in the stim.cpp:
409: #pragma acc atomic update
410: _nt->nrn_fast_imem->nrn_sav_rhs[_nd_idx] += _rhs;
411: #pragma acc atomic update
412: _nt->nrn_fast_imem->nrn_sav_d[_nd_idx] -= _g;
System (please complete the following information)
- OS: RedHat
- Compiler: NVHPC 21.2
- Version: master branch
- Backend: GPU
I think there are a few different issues here, but some observations from local testing of something related:
https://github.com/BlueBrain/CoreNeuron/blob/df95ceaf1f0bffd2000b1942ca2ba4211e7a74b0/coreneuron/sim/fast_imem.cpp#L24-L39
mismatches ecalloc_align (which wraps cudaMallocManaged when CORENEURON_UNIFIED_MEMORY is set) with free -- we should use free_memory instead, which forwards to cudaFree when needed.
Probably both the NrnFastImem and TrajectoryRequests structs should inherit from MemoryManaged, or we should otherwise make sure they are allocated in unified memory in these builds.
There is another issue with TrajectoryRequests::varrays, which is allocated by NEURON, but which is assumed to have a device version that is writeable from the device: https://github.com/BlueBrain/CoreNeuron/blob/df95ceaf1f0bffd2000b1942ca2ba4211e7a74b0/coreneuron/sim/fadvance_core.cpp#L301
In unified memory builds, we would need to somehow swap in a unified memory buffer here and copy it to NEURON's buffer as needed.