prrte
prrte copied to clipboard
Checklist for "stable" landing point
With the project winding down, it is time to define a stable landing point where we can leave it for those wanting to use it. This means:
- removing all stale code, particularly components that aren't actively used
- collapsing frameworks into single code directories where multiple variations are not required (e.g., rtc)
- reducing complexity wherever possible
We'll keep a checklist here as we work thru the process - will culminate in a new PRRTE v4 release series
Code pruning and correction
- [x] Remove "likwid" mapper - never implemented
- [x] Remove "slurm" and "mpich" personalities - never fully implemented nor used
- [ ] Collapse "rtc" framework
- [ ] Collapse "oob" framework - consolidate the messaging system and refactor it
- [ ] Remove "psched" tool - being replaced by external "dynasched" Python project
- [ ] Revamp tool system - replace individual tools (e.g., "pterm") with options to "prte" itself to remove conflicts with other packages, need to design this as we must retain "prterun" and "prun" as separate cmds
- [ ] Resolve "permanent" solution to the Slurm plm problem - use new launcher lib if it becomes available, otherwise may need to remove envar support for the internal "srun" cmd line options
Enhancements
- [ ] Add PRRTE-internal resiliency support - recover connections to grandparents when parent connection is lost, restore parent connection if/when parent returns, number collective messages to ensure replay when necessary
Scheduler integration
- [ ] Resolve question of moving scheduler integration support into separate branch
- [ ] Complete node extension support for adding nodes on-the-fly
- [ ] Complete session directive support - e.g., session/job preemption