extend prerequisite field in job protocol
Motivation
OpenPAI protocol support users to specify prerequisites (e.g. dockerimage, data, and script) and then reference them in taskrole. There are some limitations in current version.
- current solution only support parameter (e.g.
uri) definition. This is enough for the most frequently useddockerimagebecause docker plays a role of corresponding runtime executor. However, it is too limited for other types. For example, commands has to be injected in every taskrole to make the data ready in the job config below. - it is not well organized (object-oriented). The command
wgetis actions with the data, but it could not be placed together.- It is hard to reuse. If the data is referenced by more than one taskrole, the
wgetcommands must be injected everywhere. - It is hard to use. User (or marketplace plugin) must modify more than one places to enable a data.
- It is hard to reuse. If the data is referenced by more than one taskrole, the
- taskrole could only reference one data (or script, output)
prerequisites:
- name: covid_data
type: data
uri:
- https://x.x.x/yyy.zip # data uri
- name: default_image
type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
taskRoles:
taskrole:
dockerImage: default_image
data: covid_data
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget <% $data.uri[0] %>'
- export DATA_DIR=/data/covid19/data/
Goal
- propose protocol updates and runtime plugin to make
prerequisitesbe well organized and object-oriented. Besides defining parameters, it also supports real functions (callbacks on specific events). - make easy and flexible reuse of data, script, and other
prerequisites - better support management of dataset (via marketplace)
- enable advanced features (e.g. cluster data set, data location aware scheduling) in the future
- backward compatible (this version should support previous config).
Proposal
- support callbacks in
prerequisites - taskrole could reference a list of
prerequisites - runtime plugin implementation
Examples
- defining actions with data
- Different data requires different pre-commands: e.g. wget, nfs mount, azure blob download
prerequisites:
- name: covid_data
type: data
callbacks:
- event: containerStart
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget https://x.x.x/yyy.zip'
- export DATA_DIR=/data/covid19/data/
taskRoles:
taskrole:
dockerImage: default_image
prerequisites:
- covid_data
commands:
- ls $DATA_DIR
- setup environment/script prerequisites:
- Some should run before the script starts: e.g. install pip packages, install openpai sdk.
- Some should run after the script completes / succeeds / fails: e.g. log uploading, reports, alert
- Enhanced debuggability such as start jupyter server (or ssh) in 30 mins after user's command fails
Full Spec:
prerequisites:
- name: string # required, unique name to find the prerequisite (from local or marketplace)
type: "dockerimage | script | data | output" # for survey purpose (except dockerimage), useless for backend
plugin: string # optional, the executor to handle current prerequisite; default is com.microsoft.pai.runtimeplugin.cmd or docker (for dockerimage)
require: [] # optional, other prerequisites on which the current one depends
callbacks: # optional, commands to run on events
- event: "containerStart | containerExit"
commands: # commands translated by plugin
- string # shell commands for com.microsoft.pai.runtimeplugin.cmd
- string # TODO: other commands (e.g. python) for other plugins
failurePolicy: "ignore | fail" # optional, same default as runtime plugin
# plugin-specific properties
uri: string | array # optional, for backward compatibility (it is required before)
key1: value1 # referred by <% this.parameters.key1 %>
key2: value2 # TODO: inheritable from required ones
taskRoles:
taskrole:
prerequisites: # optional, requirements will be automatically parsed and inserted
- prerequisite-1 # on containerStart, will execute in order
- prerequisite-2 # on containerExit, will execute in reverse order
Each of prerequisites will be handled in a way like
for prerequisite in prerequisites:
plugin(**prerequisite)
Update of this issue:
- Will sync with @mydmdm to determine the detailed schema. This will be an P1 item for
v1.5.0release. - In OpenPAI runtime, use the following way to handle
prerequisites: (1) use existing mechanism to inject commands intopreCommandsandpostCommands(2) don't show explicit plugin definition in user's job protocol (3) make sure parameters and secrets work inprerequisites. - We can add retry policy and failure policy. This can be left to future work.
- Sync with @Binyang2014 about support for cluster data.
Question about this:
- Can all runtime plugins merge to
prerequisites? If so we could deprecated runtime extra field. Makeprerequisitesthe official way? - Maybe I can treat
prerequisitesas a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: https://github.com/microsoft/pai/issues/4254 does it related?
Question about this:
- Can all runtime plugins merge to
prerequisites? If so we could deprecated runtime extra field. Makeprerequisitesthe official way?- Maybe I can treat
prerequisitesas a high level widget. It can use any implementation they want to achieve. Such as runtime-plugin or inject-command or communicate with other k8s/pai services. And there is other proposal for service plugin: #4254 does it related?
I think they are used for different scenarios. Prerequisite is the requirement for a job. Without a prerequisite, a job usually fails. And prerequisite should be sharable among users. Runtime plugin is used to extend job protocol's functions. It can be nice-to-have (not necessary), and can be personal config (not sharable). There are some overlaps. Maybe we can move some officially-supported runtime plugin into prerequisites.
I have updated the full spec in the main body and here are some examples, including
- [ ] P0 execute essential commands
- [ ] P0 configure storage
- [ ] P0 configure data based on storage
- [ ] P1 functional plugin support (e.g. ssh)
prerequisites:
- name: install-pai-copy
type: script # indicate the purpose, not used by backend but for statistical analyzing (except dockerimage)
plugin: com.microsoft.pai.runtimeplugin.cmd # default plugin if not specified
callbacks:
- event: containerStart
commands:
- xxx # commands to setup nodejs
- npm install -g @swordfaith/pai_copy
failurePolicy: ignore/fail
- name: covid_data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd #
callbacks:
- event: containerStart
commands:
- mkdir -p /data/covid19/data/
- cd /data/covid19/data/
- 'wget https://x.x.x/yyy.zip'
- export DATA_DIR=/data/covid19/data/
- name: nfs-storage-1
type: storage # indicate the purpose, not used by backend but for statistical analyzing
plugin: com.microsoft.pai.rest.storage # handled by REST server
config: nfsconfig # special arguments for storage plugin only
mountPoint: /mnt/nfs-storage-1
- name: mnist-data
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1 # also inherit parameters like mountPoint
callbacks:
- event: containerStart
commands:
- export MNIST_DIR=<% this.mountPoint %>/mnist
- name: output-dir
type: output
plugin: com.microsoft.pai.runtimeplugin.cmd
require:
- nfs-storage-1
callbacks:
- event: containerStart
commands:
- export OUTPUT_DIR=/tmp/output
- event: containerExit
commands:
- 'if [ -z ${OUTPUT_DIR+x}]; then'
- echo "Not found OUTPUT_DIR environ"
- else
- pai_copy upload paiuploadtest //
- fi
- name: enable-ssh
type: script
plugin: com.microsoft.pai.runtimeplugin.ssh
jobssh: true
publicKeys: # optional, if not specified, only public keys in user.extensions.sshKeys will be added
- ... # public keys
taskRoles:
taskrole:
dockerImage: default_image
prerequisites:
- mnist-data # required will be automatically parsed and added in backend
- output-dir
(TBD) Test Cases for v1.5.0 release
- test cmd prerequisites
protocolVersion: 2
name: pre1
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expected runtime.log:
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] 111
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] 222
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] ...done.
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb 1 02:54:48 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] 333
[Mon Feb 1 02:54:58 UTC 2021] [openpai-runtime] 444
- test multiple prerequisites
protocolVersion: 2
name: pre2
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho_first
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- event: taskSucceeds
commands:
- echo 222
- type: script
name: justecho_later
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo aaa
- event: taskSucceeds
commands:
- echo bbb
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_first
- justecho_later
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expected runtime.log:
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Starting to exec precommands
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] 111
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] aaa
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [package_cache] Skip installation of group ssh.
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] start ssh service
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] * Restarting OpenBSD Secure Shell server sshd
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] ...done.
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] Precommands finished
[Mon Feb 1 02:55:16 UTC 2021] [openpai-runtime] [INFO] USER COMMAND START
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] [INFO] USER COMMAND END
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] bbb
[Mon Feb 1 02:55:27 UTC 2021] [openpai-runtime] 222
- test wrong config 1 Error is expected:
protocolVersion: 2
name: pre3
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo 111
- echo 222
- event: taskSucceeds
commands:
- echo 333
- echo 444
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- justecho_wrong
dockerImage: docker_image_0
resourcePerInstance:
gpu: 0
cpu: 1
memoryMB: 9672
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
- test backward-compatibility
This job should work:
protocolVersion: 2
name: covid-chestxray-dataset_88170423
description: >
COVID-19 chest X-ray image data collection
It is to build a public open dataset of chest X-ray and CT images of patients
which are positive or suspected of COVID-19 or other viral and bacterial
pneumonias
([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome),
[SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and
[ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).).
contributor: OpenPAI
type: job
jobRetryCount: 0
prerequisites:
- name: covid-chestxray-dataset
type: data
uri:
- 'https://github.com/ieee8023/covid-chestxray-dataset.git'
- name: default_image
type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.4.0-gpu'
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: default_image
data: covid-chestxray-dataset
resourcePerInstance:
cpu: 3
memoryMB: 29065
gpu: 1
commands:
- 'git clone <% $data.uri[0] %>'
defaults:
virtualCluster: default
- test data prerequiste
protocolVersion: 2
name: pre1_f7a15a5c
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: install-git
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- apt update
- apt install -y git
- type: data
name: covid-19-data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- mkdir -p /dataset/covid-19
- >-
git clone https://github.com/ieee8023/covid-chestxray-dataset.git
/dataset/covid-19
- type: dockerimage
uri: 'ubuntu:18.04'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- install-git
- covid-19-data
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- ls -la /dataset/covid-19
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expect: the data is successfully listed
- test parameter and secrets
protocolVersion: 2
name: pre_secret_parameters
type: job
jobRetryCount: 0
prerequisites:
- type: script
name: justecho
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- echo <% $parameters.x %>
- echo <% $secrets.y %>
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
parameters:
x: '111'
taskRoles:
taskrole:
prerequisites: [justecho]
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
secrets:
'y': '222'
defaults:
virtualCluster: default
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
expect 111 and 222 in runtime.log
After discussion, the interaction between prerequisites and marketplace could be:
In taskrole, the prerequisites referenced from marketplace are defined directly. No need to include them in the job level prerequisites.
Use marketplace://data/xxx and marketplace://script/xxx to indicate data and script:
taskRoles:
taskrole:
prerequisites: ["marketplace://data/mnist"]
In job protocol, prerequisites can use require to indicate required items. The required items can be from job protocol or marketplace.
prerequisites:
- type: script
name: copy_data
require: ["marketplace://script/pai_copy"]
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: containerStarts
commands:
- pai_copy data
Rest-server read all prerequisites in taskrole's prerequisites and covert them to the real definition by calling marketplace's api. This can be treated as job add-ons and saved in database.