illegal memory access, SyCL branch, CUDA 9.2, gcc 6.3.0
When I configure the SyCL branch as follows to test on a cluster with NVIDIA P100 GPUs
CXX=nvcc \
MPICXX=mpicxx \
CXXFLAGS="-ccbin=mpicxx -gencode=arch=compute_60,code=compute_60 -std=c++11" \
~/code/grid_sycl/configure \
--enable-precision=double \
--enable-simd=GPU \
--enable-accelerator=cuda \
--enable-comms=mpi \
--prefix=$(pwd)/install_dir
and attempt to run any of the benchmark executables on a P100, I encounter illegal memory accesses. Is that a known issue?
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-PCIE-16GB
AcceleratorCudaInit: totalGlobalMem: 17071734784
AcceleratorCudaInit: managedMemory: 1
AcceleratorCudaInit: isMultiGpuBoard: 0
AcceleratorCudaInit: warpSize: 32
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi: World communicator of size 1
SharedMemoryMpi: Node communicator of size 1
SharedMemoryMpi: SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x7f9462000000 for comms buffers
[...]
Grid : Message : ================================================
Grid : Message : MPI is initialised and logging filters activated
Grid : Message : ================================================
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : Requested 1073741824 byte stencil comms buffers
Grid : Message : 0.808829 s : Grid is setup to use 28 threads
Grid : Message : 0.808843 s : ====================================================================================================
Grid : Message : 0.808845 s : = Benchmarking fused AXPY bandwidth ; sizeof(Real) 8
Grid : Message : 0.808847 s : ====================================================================================================
Grid : Message : 0.808849 s : L bytes GB/s Gflop/s seconds
Grid : Message : 0.808853 s : ----------------------------------------------------------
Cuda error an illegal memory access was encountered
/qbigwork/bartek/code/grid_sycl/Grid/lattice/Lattice_arith.h
Line 230
[...]
Benchmark_memory_bandwidth: /qbigwork/bartek/code/grid_sycl/Grid/allocator/AlignedAllocator.h:101: _Tp* Grid::uvmAllocator<_Tp>::allocate(Grid::uvmAllocator<_Tp>::size_type, const void*) [with _Tp = Grid::iScalar<Grid::Grid_simd<double, Grid::GpuVector<8, double> > >; Grid::uvmAllocator<_Tp>::pointer = Grid::iScalar<Grid::Grid_simd<double, Grid::GpuVector<8, double> > >*; Grid::uvmAllocator<_Tp>::size_type = long unsigned int]: Assertion `( (_Tp*)ptr != (_Tp *)NULL )' failed.
cudaMallocManaged failed for 32768 an illegal memory access was encountered
[lnode15:31429] *** Process received signal ***
[lnode15:31429] Signal: Aborted (6)
[lnode15:31429] Signal code: (-6)
[lnode15:31429] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f94ed6f70c0]
[lnode15:31429] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcf)[0x7f94ec52dfcf]
[lnode15:31429] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f94ec52f3fa]
[lnode15:31429] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2be37)[0x7f94ec526e37]
[lnode15:31429] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2bee2)[0x7f94ec526ee2]
[lnode15:31429] [ 5] ./Benchmark_memory_bandwidth(+0x18ae3)[0x55f0d8695ae3]
[lnode15:31429] [ 6] ./Benchmark_memory_bandwidth(+0xd46d)[0x55f0d868a46d]
[lnode15:31429] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f94ec51b2b1]
[lnode15:31429] [ 8] ./Benchmark_memory_bandwidth(+0xf9fa)[0x55f0d868c9fa]
[lnode15:31429] *** End of error message ***
I'm a bit at a loss as I've been able to compile and test on my local machine, although with a different compiler and CUDA version (8.4.0 and CUDA 10.2).
What system is this on?
Reason for asking is that IBM Spectrum scale MPI causes a segfault like this unless running under mpirun, even for one node.
Run on Develop tonight on Summit V100 (merged SyCL back to develop), getting poorer performance than expected. But runs
Grid : Message : 0.343734 s : ==================================================================================================== Grid : Message : 0.343767 s : = Benchmarking fused AXPY bandwidth ; sizeof(Real) 8 Grid : Message : 0.343796 s : ==================================================================================================== Grid : Message : 0.343825 s : L bytes GB/s Gflop/s seconds Grid : Message : 0.343869 s : ---------------------------------------------------------- Grid : Message : 2.568905 s : 8 7.86e+05 46 3.84 2.21 Grid : Message : 2.844061 s : 16 1.26e+07 386 32.2 0.264 Grid : Message : 3.605300 s : 24 6.37e+07 683 56.9 0.149 Grid : Message : 3.159188 s : 32 2.01e+08 781 65.1 0.13 Grid : Message : 3.326329 s : 40 4.92e+08 824 68.7 0.123 Grid : Message : 3.529744 s : 48 1.02e+09 837 69.8 0.122 Grid : Message : 3.533171 s : ==================================================================================================== Grid : Message : 3.533187 s : = Benchmarking a*x + y bandwidth Grid : Message : 3.533199 s : ==================================================================================================== Grid : Message : 3.533210 s : L bytes GB/s Gflop/s seconds Grid : Message : 3.533226 s : ---------------------------------------------------------- Grid : Message : 6.188100 s : 8 7.86e+05 41.4 3.45 2.46 Grid : Message : 6.456361 s : 16 1.26e+07 231 19.2 0.442 Grid : Message : 6.776436 s : 24 6.37e+07 342 28.5 0.298 Grid : Message : 7.850990 s : 32 2.01e+08 373 31 0.273 Grid : Message : 7.407714 s : 40 4.92e+08 382 31.8 0.267 Grid : Message : 7.780928 s : 48 1.02e+09 386 32.2 0.264 Grid : Message : 7.784479 s : ==================================================================================================== Grid : Message : 7.784493 s : = Benchmarking SCALE bandwidth Grid : Message : 7.784504 s : ==================================================================================================== Grid : Message : 7.784515 s : L bytes GB/s Gflop/s seconds Grid : Message : 10.187901 s : 8 5.24e+05 28.3 1.77 2.4 Grid : Message : 10.592888 s : 16 8.39e+06 173 10.8 0.392 Grid : Message : 10.879673 s : 24 4.25e+07 256 16 0.266 Grid : Message : 11.155089 s : 32 1.34e+08 279 17.5 0.243 Grid : Message : 11.440638 s : 40 3.28e+08 287 17.9 0.237 Grid : Message : 11.769505 s : 48 6.79e+08 289 18.1 0.235 Grid : Message : 11.772239 s : ==================================================================================================== Grid : Message : 11.772253 s : = Benchmarking READ bandwidth Grid : Message : 11.772264 s : ==================================================================================================== Grid : Message : 11.772276 s : L bytes GB/s Gflop/s seconds Grid : Message : 11.772293 s : ---------------------------------------------------------- Grid : Message : 16.928076 s : 8 2.62e+05 6.6 1.65 5.15 Grid : Message : 17.312399 s : 16 4.19e+06 90.6 22.7 0.375 Grid : Message : 17.450996 s : 24 2.12e+07 267 66.7 0.127 Grid : Message : 17.553941 s : 32 6.71e+07 385 96.3 0.0881 Grid : Message : 17.670115 s : 40 1.64e+08 312 78.1 0.109 Grid : Message : 17.790481 s : 48 3.4e+08 302 75.6 0.112
This is on our local GPU cluster. I have a suspicion, could it be our OpenMPI version?
Open MPI repo revision: v2.0.1-579-gc849b37
Open MPI release date: Feb 20, 2017
Open RTE: 2.0.2a1
Open RTE repo revision: v2.0.1-579-gc849b37
Open RTE release date: Feb 20, 2017
OPAL: 2.0.2a1
OPAL repo revision: v2.0.1-579-gc849b37
OPAL release date: Feb 20, 2017
MPI API: 3.1.0
Ident string: 2.0.2a1
Prefix: /opt/openmpi-2.0.2a1-with-pmi
Different Cuda aware MPI's can definitely behave differently about whether you initialise MPI or Cuda first. Are you sure you've got a CUDA aware MPI ?
Yes:
$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true