Luke Baumann

Results 8 issues of Luke Baumann

If there is an upgrad available for the cluster, additional text is printed to stderr ``` * - There is an upgrade available for your cluster(s). To upgrade nodes to...

I tried making it `os.path.join(os.path.expanduser("~"), "directory")` in one of my earlier commits but that doesn't work. Explicitly someone made a mistake earlier on by not expanding `~` with [`os.path.expanduser`](https://docs.python.org/3/library/os.path.html#os.path.expanduser) so...

bug

At the moment, the command is [hard coded](https://github.com/AI-Hypercomputer/maxtext/blob/1e4d513ad70dd4074d975a9f7936295008d4b900/benchmarks/maxtext_xpk_runner.py#L412C20-L412C33) to execute the `MaxText.train` module. If the module to execute could be configured in the WorkloadConfig, that would allow for additional trainers...

feature request

Pause-Resume via Checkpoints Elasticity is a much easier to maintain and implement version of elasticity. # Description Pause-Resume runs the training loop within its own loop with retry logic provided...

stale

The new version of pathwaysutils better supports multiple JAX and Python version and includes several updates to the orbax handler and elasticity utilities.

* Added the changes to the jobset for elastic training to enable elasticity. * Added changes to launch_trainer so that the pause_resume decorator is used. * Set logging.raiseExceptions=True so that...

Changes the container name for Pathways Workers to `pathways-worker` so that the workload container name is not the same. This follows the same convention that `pathways-proxy` and `pathways-rm` follow and...