[User Story] Dataset: integrate data prerequisite into marketplace and job submission page
Motivation
#5145 has extended the prerequisite field. But users can only use and share prerequisites in job yaml. We can support UI for prerequistes, especially for data prerequisite. This issue will explain how the users create and use a data prerequisite in the cluster. With this feature, cluster users can easily share datasets with each other, and it may benefit future features e.g. dataset caching and optimization.
Explanation
How do users create a dataset in the cluster?
Dataset item that doesn't need a PVC storage
The user should create a dataset item in marketplace. dataset item has a prerequisite spec and other misc info (e.g. title, usage) in marketplace.
If the dataset is just downloaded from the Internet, it should have the following spec:
name: mnist
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
callbacks:
- event: taskStarts
commands:
- wget "<.....>" -O /dataset/mnist/<...>
Dataset item that needs a PVC storage
If the dataset is already saved in a PVC, it should have the following spec:
name: imagenet
type: data
plugin: com.microsoft.pai.runtimeplugin.cmd
requireStorages:
- name: confignfs
mountPath: /mnt/confignfs
callbacks:
- event: taskStarts
commands:
- ln -s "/dataset/imagenet" "/mnt/confignfs/users/mine/presaved-imagenet"
Here we define a new field: requireStorages. It shares the same spec as the current implementation. If this prerequisite is included in a job, we should merge the storage field here with other PVC storage.
How do users use dataset in the cluster?
On marketplace pages
On marketplace pages, users can click use to create an empty job with the corresponding dataset.

On job submission page
On job submission page, users can select his/her dataset by the field under taskrole section.

How to represent marketplace prerequisite in job yaml?
The dataset prerequisite from marketplace will be expressed as marketplace://prerequisites/itemId/<item-id>
One example is as follows:
taskRoles:
taskrole:
prerequisites: ["marketplace://prerequisites/itemId/1"]
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- echo 1
The webportal page should provide a link to marketplace for the user.
After submission, rest-server will parse these marketplace items and pass them to db controller and runtime. Rest-server should also take care of requireStorages, and merge it with other storage spec carefully.
The following errors can happen in rest-server:
- The user does not have permission to
requireStorages.- Do we need to hide these datasets for users? Currently it is hard to implement. Maybe left to future work.
- The corresponding prerequisite can not be found.
- Fail to call marketplace's API.
- Fail to download the prerequisite item.
Other features
We can enable urls like http(s):// in addition to marketplace://. It will bring a lot of convenience and easy to implement.
taskRoles:
taskrole:
prerequisites: ["https://raw.githubusercontent.com/microsoft/pai/master/contrib/xxxx.yml"]
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- echo 1
Implementation
- [ ] marketplace: provide backend and ui
- [ ] add field
requireStorages - [ ] webportal: ui change and validation
- [ ] rest-server: validation, and parse prerequisites
- [ ] database-controller
- [ ] runtime: run prerequisites
Main Design Ideas
- One prerequisite is mainly made up of
name,plugin,plugin_params,type,require. - To extend the usage, we introduce
template_variablesinplugin_params. However, when usersrequirea prerequisite, he/she must specify alltemplate_variables. This is a simplification of the mechanism, which ensures that we will never require a prerequisite with unfulfilled template variables. - Put marketplace prerequisites into
extrasfield to make the other parts of the job config cluster-agnostic.
Examples
Set up a mnist dataset
# in marketplace
- name: install_wget
plugin: cmd
plugin_params:
callbacks:
- event: taskStarts
commands:
- "apt update"
- "apt install -y wget"
# in marketplace
- name: mnist
require:
- name: marketplace://name/install_wget
plugin: cmd
plugin_params:
callbacks:
- event: taskStarts
commands:
- mkdir -p {{ dataPath }}
- wget http://1.2.3.4/mnist.zip -O {{ dataPath }}
- cd {{ dataPath }}
- unzip mnist.zip
template_variables:
- name: dataPath
# in job
prerequisites:
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- mnist
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
reference_prerequisites:
- name: mnist
require:
- name: marketplace://name/mnist
template_variables:
dataPath: /dataset/mnist
Set up a imagenet dataset
# set up a imagenet
# in marketplace
- name: confignfs_pvc
plugin: pvc_storage
plugin_params:
name: confignfs
mountPath: {{ mountPath }}
template_variables:
- name: mountPath
# in marketplace
- name: imagenet
require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
- name: marketplace://name/confignfs_pvc
template_variables:
mountPath: /mnt/confignfs_pvc
plugin: cmd
plugin_params:
callbacks:
- event: taskStarts
commands:
- mkdir -p {{ dataPath }}
- cp -r /mnt/confignfs_pvc/imagenet/* {{ dataPath }}
template_variables:
- name: dataPath
# in marketplace
- name: imagenet_only_validation
require: # if the required prerequisite has template_variables, all the template_variables MUST be fulfilled.
- name: marketplace://name/confignfs_pvc
template_variables:
mountPath: /mnt/confignfs_pvc
plugin: cmd
plugin_params:
callbacks:
- event: taskStarts
commands:
- mkdir -p {{ dataPath }}
- cp -r /mnt/confignfs_pvc/imagenet/validation/* {{ dataPath }}
template_variables:
- name: dataPath
# in job
prerequisites:
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- imagenet
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
reference_prerequisites:
- name: imagenet
require:
- name: marketplace://name/imagenet
template_variables:
dataPath: /dataset/imagenet
Set up a debug hook
# set up a debug hook
# in marketplace
- name: debug_hook
plugin: cmd
plugin_params:
callbacks:
- event: taskFails
commands:
- echo "will sleep for {{ min }} minutes for debugging..."
- sleep {{ min }}m
template_variables:
- name: min
# in job
prerequisites:
- type: dockerimage
uri: 'openpai/standard:python_3.6-pytorch_1.2.0-gpu'
name: docker_image_0
taskRoles:
taskrole:
instances: 1
completion:
minFailedInstances: 1
taskRetryCount: 0
prerequisites:
- debug_hook
dockerImage: docker_image_0
resourcePerInstance:
gpu: 1
cpu: 3
memoryMB: 29065
commands:
- sleep 0s
defaults:
virtualCluster: default
extras:
reference_prerequisites:
- name: debug_hook
require:
- name: marketplace://name/debug_hook
template_variables:
min: 30