pai
pai copied to clipboard
prerequisite: auto save checkpoint
Motivation
Jobs fail at different stages. The users will have to retrain their models from scratch if they fail to save the checkpoints / result files properly. If they have a folder which will be saved automatically, it will be easier for users to manage these data.
Design
- two prerequisite
- mount-autosave-folder: mount cluster storage to task container
- auto-save: copy check-point folder to mount point
- cluster requirement:
- cluster level storage
- schema
prerequisites:
- type: data
name: mount-autosave-folder
plugin: com.microsoft.pai.runtimeplugin.cmd
mountPoint: /mnt/auto-save
callbacks:
- event: taskStarts
commands:
- mount cluster-data-path ${mountPoint}
- type: autosave
requires: mount-autosave-folder
name: autosave-results
plugin: com.microsoft.pai.runtimeplugin.cmd
path: /user/experiments/check-points
callbacks:
- event: taskStarts
commands:
- apt update
- apt install rsync
- event: taskEnds
commands:
- rsync ${path} /mnt/auto-save
Some comments:
- Current only
taskSucceedsis supported. We should provide ataskEndscallback. - We should figure out a way for user to specify a path for auto-saving.
- What if there are multiple tasks within one taskrole?
- What if there are multiple retries?