Grid illegal memory access, SyCL branch, CUDA 9.2, gcc 6.3.0

When I configure the SyCL branch as follows to test on a cluster with NVIDIA P100 GPUs

CXX=nvcc \                                                                                                                                  
MPICXX=mpicxx \
CXXFLAGS="-ccbin=mpicxx -gencode=arch=compute_60,code=compute_60 -std=c++11" \
~/code/grid_sycl/configure \
  --enable-precision=double \
  --enable-simd=GPU \
  --enable-accelerator=cuda \
  --enable-comms=mpi \
  --prefix=$(pwd)/install_dir

and attempt to run any of the benchmark executables on a P100, I encounter illegal memory accesses. Is that a known issue?

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-PCIE-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x7f9462000000 for comms buffers 
[...]
Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : 0.808829 s : Grid is setup to use 28 threads
Grid : Message : 0.808843 s : ====================================================================================================
Grid : Message : 0.808845 s : = Benchmarking fused AXPY bandwidth ; sizeof(Real) 8
Grid : Message : 0.808847 s : ====================================================================================================
Grid : Message : 0.808849 s :   L               bytes                   GB/s            Gflop/s          seconds
Grid : Message : 0.808853 s : ----------------------------------------------------------
Cuda error an illegal memory access was encountered 
/qbigwork/bartek/code/grid_sycl/Grid/lattice/Lattice_arith.h
Line 230
[...]
Benchmark_memory_bandwidth: /qbigwork/bartek/code/grid_sycl/Grid/allocator/AlignedAllocator.h:101: _Tp* Grid::uvmAllocator<_Tp>::allocate(Grid::uvmAllocator<_Tp>::size_type, const void*) [with _Tp = Grid::iScalar<Grid::Grid_simd<double, Grid::GpuVector<8, double> > >; Grid::uvmAllocator<_Tp>::pointer = Grid::iScalar<Grid::Grid_simd<double, Grid::GpuVector<8, double> > >*; Grid::uvmAllocator<_Tp>::size_type = long unsigned int]: Assertion `( (_Tp*)ptr != (_Tp *)NULL )' failed.
 cudaMallocManaged failed for 32768 an illegal memory access was encountered 
[lnode15:31429] *** Process received signal ***
[lnode15:31429] Signal: Aborted (6)
[lnode15:31429] Signal code:  (-6)
[lnode15:31429] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f94ed6f70c0]
[lnode15:31429] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f94ec52dfcf]
[lnode15:31429] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f94ec52f3fa]
[lnode15:31429] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2be37)[0x7f94ec526e37]
[lnode15:31429] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2bee2)[0x7f94ec526ee2]
[lnode15:31429] [ 5] ./Benchmark_memory_bandwidth(+0x18ae3)[0x55f0d8695ae3]
[lnode15:31429] [ 6] ./Benchmark_memory_bandwidth(+0xd46d)[0x55f0d868a46d]
[lnode15:31429] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f94ec51b2b1]
[lnode15:31429] [ 8] ./Benchmark_memory_bandwidth(+0xf9fa)[0x55f0d868c9fa]
[lnode15:31429] *** End of error message ***

I'm a bit at a loss as I've been able to compile and test on my local machine, although with a different compiler and CUDA version (8.4.0 and CUDA 10.2).

Jun 23 '20 18:06 kostrzewa

What system is this on?

Reason for asking is that IBM Spectrum scale MPI causes a segfault like this unless running under mpirun, even for one node.

Jun 23 '20 23:06 paboyle

Run on Develop tonight on Summit V100 (merged SyCL back to develop), getting poorer performance than expected. But runs

Grid : Message : 0.343734 s : ==================================================================================================== Grid : Message : 0.343767 s : = Benchmarking fused AXPY bandwidth ; sizeof(Real) 8 Grid : Message : 0.343796 s : ==================================================================================================== Grid : Message : 0.343825 s : L bytes GB/s Gflop/s seconds Grid : Message : 0.343869 s : ---------------------------------------------------------- Grid : Message : 2.568905 s : 8 7.86e+05 46 3.84 2.21 Grid : Message : 2.844061 s : 16 1.26e+07 386 32.2 0.264 Grid : Message : 3.605300 s : 24 6.37e+07 683 56.9 0.149 Grid : Message : 3.159188 s : 32 2.01e+08 781 65.1 0.13 Grid : Message : 3.326329 s : 40 4.92e+08 824 68.7 0.123 Grid : Message : 3.529744 s : 48 1.02e+09 837 69.8 0.122 Grid : Message : 3.533171 s : ==================================================================================================== Grid : Message : 3.533187 s : = Benchmarking a*x + y bandwidth Grid : Message : 3.533199 s : ==================================================================================================== Grid : Message : 3.533210 s : L bytes GB/s Gflop/s seconds Grid : Message : 3.533226 s : ---------------------------------------------------------- Grid : Message : 6.188100 s : 8 7.86e+05 41.4 3.45 2.46 Grid : Message : 6.456361 s : 16 1.26e+07 231 19.2 0.442 Grid : Message : 6.776436 s : 24 6.37e+07 342 28.5 0.298 Grid : Message : 7.850990 s : 32 2.01e+08 373 31 0.273 Grid : Message : 7.407714 s : 40 4.92e+08 382 31.8 0.267 Grid : Message : 7.780928 s : 48 1.02e+09 386 32.2 0.264 Grid : Message : 7.784479 s : ==================================================================================================== Grid : Message : 7.784493 s : = Benchmarking SCALE bandwidth Grid : Message : 7.784504 s : ==================================================================================================== Grid : Message : 7.784515 s : L bytes GB/s Gflop/s seconds Grid : Message : 10.187901 s : 8 5.24e+05 28.3 1.77 2.4 Grid : Message : 10.592888 s : 16 8.39e+06 173 10.8 0.392 Grid : Message : 10.879673 s : 24 4.25e+07 256 16 0.266 Grid : Message : 11.155089 s : 32 1.34e+08 279 17.5 0.243 Grid : Message : 11.440638 s : 40 3.28e+08 287 17.9 0.237 Grid : Message : 11.769505 s : 48 6.79e+08 289 18.1 0.235 Grid : Message : 11.772239 s : ==================================================================================================== Grid : Message : 11.772253 s : = Benchmarking READ bandwidth Grid : Message : 11.772264 s : ==================================================================================================== Grid : Message : 11.772276 s : L bytes GB/s Gflop/s seconds Grid : Message : 11.772293 s : ---------------------------------------------------------- Grid : Message : 16.928076 s : 8 2.62e+05 6.6 1.65 5.15 Grid : Message : 17.312399 s : 16 4.19e+06 90.6 22.7 0.375 Grid : Message : 17.450996 s : 24 2.12e+07 267 66.7 0.127 Grid : Message : 17.553941 s : 32 6.71e+07 385 96.3 0.0881 Grid : Message : 17.670115 s : 40 1.64e+08 312 78.1 0.109 Grid : Message : 17.790481 s : 48 3.4e+08 302 75.6 0.112

Jun 24 '20 02:06 paboyle

This is on our local GPU cluster. I have a suspicion, could it be our OpenMPI version?

  Open MPI repo revision: v2.0.1-579-gc849b37
   Open MPI release date: Feb 20, 2017
                Open RTE: 2.0.2a1
  Open RTE repo revision: v2.0.1-579-gc849b37
   Open RTE release date: Feb 20, 2017
                    OPAL: 2.0.2a1
      OPAL repo revision: v2.0.1-579-gc849b37
       OPAL release date: Feb 20, 2017
                 MPI API: 3.1.0
            Ident string: 2.0.2a1
                  Prefix: /opt/openmpi-2.0.2a1-with-pmi

Jun 24 '20 06:06 kostrzewa

Different Cuda aware MPI's can definitely behave differently about whether you initialise MPI or Cuda first. Are you sure you've got a CUDA aware MPI ?

Jun 24 '20 22:06 paboyle

Yes:

 $ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true

Jun 25 '20 06:06 kostrzewa