pai icon indicating copy to clipboard operation
pai copied to clipboard

prerequisite: auto save checkpoint

Open suiguoxin opened this issue 4 years ago • 1 comments

Motivation

Jobs fail at different stages. The users will have to retrain their models from scratch if they fail to save the checkpoints / result files properly. If they have a folder which will be saved automatically, it will be easier for users to manage these data.

Design

  • two prerequisite
    • mount-autosave-folder: mount cluster storage to task container
    • auto-save: copy check-point folder to mount point
  • cluster requirement:
    • cluster level storage
  • schema
prerequisites:
  - type: data
    name: mount-autosave-folder
    plugin: com.microsoft.pai.runtimeplugin.cmd
    mountPoint: /mnt/auto-save
    callbacks:
       - event: taskStarts
        commands:
          - mount  cluster-data-path ${mountPoint} 
  - type: autosave
    requires: mount-autosave-folder
    name: autosave-results
    plugin: com.microsoft.pai.runtimeplugin.cmd
    path: /user/experiments/check-points
    callbacks:
       - event: taskStarts
        commands:
          - apt update 
          - apt install rsync
      - event: taskEnds
        commands:
          - rsync ${path} /mnt/auto-save

suiguoxin avatar Mar 02 '21 09:03 suiguoxin

Some comments:

  • Current only taskSucceeds is supported. We should provide a taskEnds callback.
  • We should figure out a way for user to specify a path for auto-saving.
  • What if there are multiple tasks within one taskrole?
  • What if there are multiple retries?

hzy46 avatar Mar 03 '21 03:03 hzy46