prrte icon indicating copy to clipboard operation
prrte copied to clipboard

Checklist for "stable" landing point

Open rhc54 opened this issue 1 year ago • 0 comments

With the project winding down, it is time to define a stable landing point where we can leave it for those wanting to use it. This means:

  • removing all stale code, particularly components that aren't actively used
  • collapsing frameworks into single code directories where multiple variations are not required (e.g., rtc)
  • reducing complexity wherever possible

We'll keep a checklist here as we work thru the process - will culminate in a new PRRTE v4 release series

Code pruning and correction

  • [x] Remove "likwid" mapper - never implemented
  • [x] Remove "slurm" and "mpich" personalities - never fully implemented nor used
  • [ ] Collapse "rtc" framework
  • [ ] Collapse "oob" framework - consolidate the messaging system and refactor it
  • [ ] Remove "psched" tool - being replaced by external "dynasched" Python project
  • [ ] Revamp tool system - replace individual tools (e.g., "pterm") with options to "prte" itself to remove conflicts with other packages, need to design this as we must retain "prterun" and "prun" as separate cmds
  • [ ] Resolve "permanent" solution to the Slurm plm problem - use new launcher lib if it becomes available, otherwise may need to remove envar support for the internal "srun" cmd line options

Enhancements

  • [ ] Add PRRTE-internal resiliency support - recover connections to grandparents when parent connection is lost, restore parent connection if/when parent returns, number collective messages to ensure replay when necessary

Scheduler integration

  • [ ] Resolve question of moving scheduler integration support into separate branch
  • [ ] Complete node extension support for adding nodes on-the-fly
  • [ ] Complete session directive support - e.g., session/job preemption

rhc54 avatar Oct 01 '24 20:10 rhc54