NEURON integrated tests failing with CUDA Unified Memory enabled

Open iomaganaris opened this issue 4 years ago • 1 comments

Describe the issue Some of the NEURON test are failing on GPU when CUDA Unified Memory is enabled in CoreNEURON. More precisely the tests that fail are:

The following tests FAILED:
         18 - coreneuron_modtests::direct_py (Failed)
         19 - coreneuron_modtests::direct_hoc (Failed)
         20 - coreneuron_modtests::spikes_py (Failed)
         21 - coreneuron_modtests::spikes_file_mode_py (Failed)
         22 - coreneuron_modtests::datareturn_py (Failed)
         25 - coreneuron_modtests::spikes_mpi_py (Failed)
         26 - coreneuron_modtests::spikes_mpi_file_mode_py (Failed)
         41 - testcorenrn_patstim::coreneuron_gpu_offline (Failed)
         45 - testcorenrn_patstim::compare_results (Failed)
         99 - testcorenrn_netstimdirect::direct (Failed)
        100 - testcorenrn_netstimdirect::compare_results (Failed)

To Reproduce Steps to reproduce the behavior:

git clone [email protected]:neuronsimulator/nrn.git
cd nrn
mkdir build_unified && cd build_unified
cmake .. -DCMAKE_INSTALL_PREFIX=./install -DNRN_ENABLE_INTERVIEWS=OFF -DNRN_ENABLE_RX3D=OFF -DNRN_ENABLE_CORENEURON=ON -DNRN_ENABLE_TESTS=ON -DCORENRN_ENABLE_GPU=ON -DCORENRN_ENABLE_CU
DA_UNIFIED_MEMORY=ON -DCORENRN_ENABLE_OPENMP=OFF
make -j16
ctest --output-on-failure

Expected behavior GPU tests should be passing with Unified Memory as well.

Logs An example of a failing test (coreneuron_modtests::direct_py) when run with cuda-memcheck has the following output:

========= Invalid __global__ read of size 8
=========     at 0x00000730 in /gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x05121860 is out of bounds
=========     Device Frame:/gpfs/bbp.cscs.ch/data/scratch/proj16/magkanar/psolve-direct/nrn_gpu/build_unified/test/nrnivmodl/8e220c327f2b8882adcf04884baa4209f37d0bbcef5677f046766f546d969ffd/x86_64/corenrn/mod2c/stim.cpp:410:coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) (coreneuron::_nrn_cur__IClamp_370_gpu(coreneuron::NrnThread*, coreneuron::Memb_list*, int) : 0x730)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/lib64/libcuda.so (cuLaunchKernel + 0x34e) [0x2efa6e]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch3 + 0x1d94) [0x1ca43]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so [0x1d7a5]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libaccdevice.so (__pgi_uacc_cuda_launch + 0x13d) [0x1d8e4]
=========     Host Frame:/gpfs/bbp.cscs.ch/ssd/apps/hpc/jenkins/deploy/externals/2021-01-06/linux-rhel7-x86_64/gcc-9.3.0/nvhpc-21.2-67d2qp/Linux_x86_64/21.2/compilers/lib/libacchost.so (__pgi_uacc_launch + 0x1f7) [0x463c0]
=========     Host Frame:./x86_64/special (_ZN10coreneuron16_nrn_cur__IClampEPNS_9NrnThreadEPNS_9Memb_listEi + 0x89b) [0x5702b]
=========     Host Frame:./x86_64/special [0x17f3bb]
=========     Host Frame:./x86_64/special (_ZN10coreneuron25setup_tree_matrix_minimalEPNS_9NrnThreadE + 0xe) [0x1814ae]
Failing in Thread:1
call to cuLaunchKernel returned error 719: Launch failed (often invalid pointer dereference)

The corresponding line that fails in the stim.cpp:

409:      #pragma acc atomic update
410:      _nt->nrn_fast_imem->nrn_sav_rhs[_nd_idx] += _rhs;
411:      #pragma acc atomic update
412:      _nt->nrn_fast_imem->nrn_sav_d[_nd_idx] -= _g;

System (please complete the following information)

OS: RedHat
Compiler: NVHPC 21.2
Version: master branch
Backend: GPU

Jul 20 '21 14:07 iomaganaris

I think there are a few different issues here, but some observations from local testing of something related:

https://github.com/BlueBrain/CoreNeuron/blob/df95ceaf1f0bffd2000b1942ca2ba4211e7a74b0/coreneuron/sim/fast_imem.cpp#L24-L39 mismatches ecalloc_align (which wraps cudaMallocManaged when CORENEURON_UNIFIED_MEMORY is set) with free -- we should use free_memory instead, which forwards to cudaFree when needed.

Probably both the NrnFastImem and TrajectoryRequests structs should inherit from MemoryManaged, or we should otherwise make sure they are allocated in unified memory in these builds.

There is another issue with TrajectoryRequests::varrays, which is allocated by NEURON, but which is assumed to have a device version that is writeable from the device: https://github.com/BlueBrain/CoreNeuron/blob/df95ceaf1f0bffd2000b1942ca2ba4211e7a74b0/coreneuron/sim/fadvance_core.cpp#L301

In unified memory builds, we would need to somehow swap in a unified memory buffer here and copy it to NEURON's buffer as needed.

Apr 20 '22 11:04 olupton