analyse intel_transport_recv.h at line 1160: cma_read_nbytes == size assert
Update: Bug in MPI Jira: https://jira.devtools.intel.com/browse/IMPI-4619
when running on devcloud
ctest -R mhp-sycl-sort-tests-3
on branch https://github.com/lslusarczyk/distributed-ranges/tree/mateusz_sort_expose_mpi_assert
we hit
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPL_backtrace_show+0x1c) [0x14c1a5a7236c]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(MPIR_Assert_fail+0x21) [0x14c1a5429131]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb22e38) [0x14c1a5922e38]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1fa41) [0x14c1a591fa41]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0xb1cd4d) [0x14c1a591cd4d]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(+0x2f58b4) [0x14c1a50f58b4]
/opt/intel/oneapi/mpi/2021.10.0//lib/release/libmpi.so.12(PMPI_Wait+0x41f) [0x14c1a56816af]
./mhp-tests() [0x5c863d]
./mhp-tests() [0x58e124]
./mhp-tests() [0x6cdd0c]
./mhp-tests() [0x75676c]
./mhp-tests() [0x7374c5]
./mhp-tests() [0x738b33]
./mhp-tests() [0x73974f]
./mhp-tests() [0x74df0f]
./mhp-tests() [0x74cfcb]
./mhp-tests() [0x472f7f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x14c1a3ce3d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x14c1a3ce3e40]
./mhp-tests() [0x46f005]
Abort(1) on node 0: Internal error
Some links on useful Intel MPI documentation, tips and hacks:
Intel® MPI for GPU Clusters - article https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-2/intel-mpi-for-gpu-clusters.html
Environment variables influencing the way GPU support works.
https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-support.html https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-pinning.html
Still, I found the tip for solution of the problem here: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/intel-mpi-error-line-1334-cma-read-nbytes-size/m-p/1329220
export I_MPI_SHM_CMA=0 helped in some cases (yet the behaviour seems to be not fully deterministic, maybe depends on which devcloud node is assigned for execution)
People had similar problems in the past: https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/Intel-oneAPI-2021-4-SHM-Issue/m-p/1324805
When setting the env vars to:
export I_MPI_FABRICS=shm
export I_MPI_SHM_CMA=0
export I_MPI_OFFLOAD=1
You may also encounter:
Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_send.h at line 2012: FALSE
...
Still, simple solution - copy memory from device to host - is countereffective, as IMPI supports GPU-GPU communication (see https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-10/gpu-buffers-support.html#SECTION_3F5D70BDEFF84E3A84325A319BA53536)
Blocked on unable to install MPI 2021.11. We don't know where to get if from.
Blocked on unable to install MPI 2021.11. We don't know where to get if from.
It's here: http://anpfclxlin02.an.intel.com/rscohn1/
Problem with assert in intel_transport_send.h at line 2012 is solved in IMPI 2021.11 (tested on devcloud, with IMPI 2021.11 installed in home dir)
2021.11 will be published on 11/17
I_MPI_OFFLOAD=0 mpirun -n 2 ./build/benchmarks/gbench/mhp/mhp-bench --sycl --benchmark_filter=Sort_DR -> Assertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1175: cma_read_nbytes == size
However, with I_MPI_OFFLOAD=1 (which should be used with IMPI on GPU) execution of Sort benchmark is successful (devcloud single server, multi-GPU). IMPI 2021.11 private install.
This is how I set I_MPI_OFFLOAD for the device memory tests: https://github.com/oneapi-src/distributed-ranges/blob/6ad80e79a37be37634f9c0b8f71a62a3b2e73862/CMakeLists.txt#L216
I was told that for the 2021.11 release we can set I_MPI_OFFLOAD=1 all the time and it will not cause an error. I will get rid of this function and set I_MPI_OFFLOAD=1 in the CI script.