[WIP] Boost dash::sort with even more parallelism
We need more parallelism to exploit the power of many-core nodes. The underlying algorithm itself will be rewritten to eliminate barriers. This includes the following major changes:
-
Algorithmic improvements:
- [x] reduce communication overhead (
alltoallcommunications) - [x] overlap communication and the final merge step as efficiently as possible (@pascalj)
- [ ] Instead of perfect partitioning we provide another variant where we do not require an in-place sort. Users can provide a larger output buffer with the same pattern, but each unit has a certain threshold of additional local storage. Example:
dash::sort(first, last, out, sort_hash).
- [x] reduce communication overhead (
-
Minor changes
- integrate Intel Parallel STL (based on Intel TBB) into DASH to exploit shared memory parallelism more efficiently.
- Allow
DART_UNDEFINED_UNIT_IDas a valid unit for DART communication routines #617
-
Further Impacts
- Threadsupport is now enabled by default in CI. Other tests fail, needs some investigation.
Note: This list will grow.
Codecov Report
Merging #611 into development will decrease coverage by
0.38%. The diff coverage is81.99%.
@@ Coverage Diff @@
## development #611 +/- ##
===============================================
- Coverage 84.95% 84.57% -0.39%
===============================================
Files 335 344 +9
Lines 24821 25028 +207
Branches 11497 11285 -212
===============================================
+ Hits 21087 21167 +80
- Misses 3733 3851 +118
- Partials 1 10 +9
| Impacted Files | Coverage Δ | |
|---|---|---|
| dash/include/dash/iterator/internal/GlobPtrBase.h | 91.2% <ø> (-0.1%) |
:arrow_down: |
| dash/include/dash/internal/Logging.h | 100% <ø> (ø) |
:arrow_up: |
| dash/include/cpp17/monotonic_buffer.h | 0% <0%> (ø) |
|
| dash/src/cpp17/monotonic_buffer.cc | 0% <0%> (ø) |
|
| dash/include/dash/algorithm/sort/Histogram.h | 100% <100%> (ø) |
|
| dash/include/dash/algorithm/sort/Sampling.h | 100% <100%> (ø) |
|
| dash/test/algorithm/SortTest.cc | 98.26% <100%> (+0.45%) |
:arrow_up: |
| dash/include/dash/algorithm/sort/Communication.h | 100% <100%> (ø) |
|
| dash/include/dash/algorithm/sort/Types.h | 100% <100%> (ø) |
|
| dash/include/dash/algorithm/sort/Sort-inl.h | 100% <100%> (ø) |
|
| ... and 19 more |
We definitely need a way to configure the number of threads from the outside at runtime, e.g., through an environment variable. That inevitably leads to the wider question of how we want to handle runtime configuration (config files? env variables? both? who is in charge of the parsing? right now DART and DASH both do their own thing but that is sub-optimal...)
Currently we support three env variables to configure multi-threading which is built-in in our locality stuff. See the documentation of UnitLocality.num_domain_threads():
-
DASH_DISABLE_THREADS: If set, disables multi-threading at unit scope and this method returns 1 -
DASH_MAX_SMT: If set, virtual SMT CPUs (hyperthreads) instead of physical cores are used to determine availble threads. -
DASH_MAX_UNIT_THREADS: Specifies the maximum number of threads available to a single unit.
I suppose this is also built into DART somehow since the locality interface is implemented down there.