oneDAL icon indicating copy to clipboard operation
oneDAL copied to clipboard

MPI GPU interface refactoring

Open ethanglaser opened this issue 2 years ago • 45 comments

Description

Changes proposed in this pull request:

  • Add virtual get_mpi_offload_support function to base communicator - defaults to false in nearly all cases
  • Add logic to get_mpi_offload_support function in mpi/communicator.h to check mpi libs for correct symbol and determine if level zero is supported
  • Add conditional in detail/communicator.cpp that uses result of get_mpi_offload_support to determine whether to convert data to host (previous default) or leave as is (yields performance improvements if GPU offload support in MPI)
  • Modify sendrecv_replace args to include optional additional buffer to accommodate MPICH workaround to call sendrecv with 2 GPU buffers

ethanglaser avatar Nov 14 '23 21:11 ethanglaser

/intelci: run

ethanglaser avatar Nov 14 '23 21:11 ethanglaser

Looks as a great opportunity to get more speedup across all algorithms

Thanks! Yeah its pretty ugly right now, working towards functional first then will clean things up. But good points.

ethanglaser avatar Nov 15 '23 22:11 ethanglaser

/intelci: run

ethanglaser avatar Dec 19 '23 14:12 ethanglaser

/intelci: run

ethanglaser avatar Jan 02 '24 14:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 02 '24 16:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 02 '24 17:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 03 '24 22:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 08 '24 16:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 11 '24 16:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 29 '24 19:01 ethanglaser

/intelci: run

ethanglaser avatar Jan 30 '24 18:01 ethanglaser

/intelci: run

ethanglaser avatar Feb 13 '24 18:02 ethanglaser

/intelci: run

ethanglaser avatar Mar 06 '24 15:03 ethanglaser

/intelci: run

ethanglaser avatar Mar 06 '24 16:03 ethanglaser

/intelci: run

ethanglaser avatar Mar 06 '24 21:03 ethanglaser

/intelci: run

ethanglaser avatar Mar 07 '24 01:03 ethanglaser

Nightly combined with infra branch: http://intel-ci.intel.com/eedcda75-51a4-f11e-8bab-a4bf010d0e2e

ethanglaser avatar Mar 07 '24 23:03 ethanglaser

/intelci: run

Alexandr-Solovev avatar Apr 03 '24 09:04 Alexandr-Solovev

/intelci: run

ethanglaser avatar Apr 09 '24 15:04 ethanglaser

/intelci: run

ethanglaser avatar Apr 25 '24 17:04 ethanglaser

/intelci: run

ethanglaser avatar May 02 '24 19:05 ethanglaser

/intelci: run

ethanglaser avatar May 06 '24 17:05 ethanglaser

Final steps are to check MPICH scalability with alternative approach, confirm infra changes, and determine whether its necessary to add any additional conditions to use offloading

ethanglaser avatar May 06 '24 22:05 ethanglaser

/intelci: run

ethanglaser avatar May 08 '24 21:05 ethanglaser

/intelci: run

ethanglaser avatar May 08 '24 22:05 ethanglaser

/intelci: run

ethanglaser avatar May 08 '24 23:05 ethanglaser

/intelci: run

ethanglaser avatar May 09 '24 17:05 ethanglaser

/intelci: run

ethanglaser avatar May 09 '24 17:05 ethanglaser

/intelci: run

ethanglaser avatar May 09 '24 18:05 ethanglaser

/intelci: run

ethanglaser avatar May 09 '24 19:05 ethanglaser