maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Adding Pathways pause-resume to the default trainer

Open lukebaumann opened this issue 4 months ago • 3 comments

Pause-Resume via Checkpoints Elasticity is a much easier to maintain and implement version of elasticity.

Description

Pause-Resume runs the training loop within its own loop with retry logic provided by pathwaysutils. In the case of an error due to a lost slice, the training is paused until a replacement (or restarted) slice has rejoined. After all of the slices needed have rejoined the Pathways cluster, the train loop is restarted. This restart triggers the existing checkpoint restore.

Tests

Existing tests will confirm this does not break the non-Pathways configuration. The coverage of the logic within pause-resume is in pathwaysutils. This has been tested manually with triggered errors and successfully recovers. There are nightly integration tests that are being monitored by the Pathways on Cloud team.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • [X] I have performed a self-review of my code.
  • [X] I have necessary comments in my code, particularly in hard-to-understand areas.
  • [X] I have run end-to-end tests tests and provided workload links above if applicable.
  • [X] I have made or will make corresponding changes to the doc if needed.

lukebaumann avatar Sep 03 '25 18:09 lukebaumann

This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 04 '25 16:10 github-actions[bot]

This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.

github-actions[bot] avatar Oct 11 '25 16:10 github-actions[bot]

This PR has been automatically marked as stale because it has not had recent activity. It will be closed soon if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 24 '25 16:11 github-actions[bot]

This PR was closed because it has been inactive for a while. Please reopen it if you are still working on it.

github-actions[bot] avatar Dec 01 '25 16:12 github-actions[bot]