Adding Pathways pause-resume to the default trainer
Pause-Resume via Checkpoints Elasticity is a much easier to maintain and implement version of elasticity.
Description
Pause-Resume runs the training loop within its own loop with retry logic provided by pathwaysutils. In the case of an error due to a lost slice, the training is paused until a replacement (or restarted) slice has rejoined. After all of the slices needed have rejoined the Pathways cluster, the train loop is restarted. This restart triggers the existing checkpoint restore.
Tests
Existing tests will confirm this does not break the non-Pathways configuration. The coverage of the logic within pause-resume is in pathwaysutils. This has been tested manually with triggered errors and successfully recovers. There are nightly integration tests that are being monitored by the Pathways on Cloud team.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
- [X] I have performed a self-review of my code.
- [X] I have necessary comments in my code, particularly in hard-to-understand areas.
- [X] I have run end-to-end tests tests and provided workload links above if applicable.
- [X] I have made or will make corresponding changes to the doc if needed.
This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.
This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.
This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.
This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.