Add MPI speedup to pysteps blending
At RMI we are aiming to run pysteps in an operational environment. One thing that could improve copute time is adding MPI optionality. More info will be provided later in this thread by @mpvginde
The STEPS blending routine was recently refactored for clarity, but its performance footprint is still unknown. The historical “MPI version” that sat in a separate steps_mpi.py file tried to gain speed via mpi4py, while the new code is still single-process (although cleaner). Before we invest engineering time into reviving/rewriting a full MPI backend, we need solid evidence that parallelisation is worth the maintenance cost – and that MPI is the right flavour of parallelism.
Below a suggestion of future steps:
1 · Candidate approaches
| Option | One-liner description |
|---|---|
| MPI4Py (manual scaling) | Revive steps_mpi through a clean adapter; launch with mpirun. |
| xarray → dask.array → Dask scheduler | Store fields as xarray objects, let Dask manage chunked execution across cores/nodes. |
2 · Benchmark we must run
Measure wall-clock time and peak RAM for an identical 60-min, 12-member forecast on two machines (laptop 8 cores, HPC 2 × 20 cores).
| Label | What we test |
|---|---|
| A | Old master branch (pre-refactor) |
| B | New master branch (current, single-process) |
| C | Old mpi-steps (pre-refactor) |
| D | Branch B plus prototype MPI adapter (same code paths, split over ranks) |
| E | Branch B plus prototype xarray code (same code paths, split over ranks) |
3 · Acceptance & decision rule
Deliverables
- [ ] Repro script(s) that run A, B, C and dump timing results (CSV).
- [ ] Results table posted in this thread.
Decision
If C is ≥ 20 % faster than B on the HPC node and scales close to linearly, we proceed with a full MPI rewrite of STEPS blending.
Otherwise we keep single-process and explore the Dask route instead.