Inform the user when jobs status change
Motivation
Some jobs may fail unexpectedly. If the users can be informed when the jobs fail, the users will be able to handle the issue in time. This will save the users from checking their job status all the time.
Similar for other status changes.
Background:
- This function should be set by job instead of user
- The trigger event can be
- Start Running
- Failed
- Succeeded
- WaitingTooLong
- Notification can be sent to users by email / webportal and this should be configurable
- some notification methods maybe not available if the admin doesn't enable it
Design
Workflow:
- Part 1: Job configuration
- Part 2: monitor & trigger corresponding alerts
- Part 3: alerts handling
Part 1: Job / User configuration
-
What alerts to send is configured by job:
- enable this feature in job protocal, in the field
extras->jobStatusChangeNotification - support further modification after jobs get submitted
- enable this feature in job protocal, in the field
extras:
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
hivedScheduler:
taskRoles:
taskrole:
skuNum: 1
skuType: GENERIC-WORKER
jobStatusChangeNotification:
running: false
succeeded: true
stopped: false
failed: true
retried: false
-
How to send alerts is configured by user: set in user-profile page, user can select from these available actions:
- [ ] webportal notification
- [ ] email notification: this action will only be available when : 1) user email is not empty; 2)
email-useraction is available inalert-handler
{
"username": "gusui",
"email": "[email protected]",
"extension": {
"sshKeys": [],
"getJobStatusChangeNotificationBy":
email: true,
webportal: true
},
}
Part 2: monitor & trigger corresponding alerts
design with DB
- add the following columns to the
frameworktable in DB:
notificationAtRunning | BOOLEAN
notifiedAtRunning | BOOLEAN
notificationAtSucceeded | BOOLEAN
notifiedAtSucceeded | BOOLEAN
notificationAtFailed | BOOLEAN
notifiedAtFailed | BOOLEAN
notificationAtRetried | BOOLEAN
notifiedAtRetried | INTERGER (the Nth retry has been notified)
these columns are used to save job config & alerts state
- add a container
framework-status-notification-pollerinalert-manager, which- watch DB
frameworktable - send the alert when the config is enabled & the alert has not been sent
- update
frameworktable after successfully sending alerts toalert-manager
- watch DB
Part 3: alerts handling
-
src/alert-manager/deploy/alert-manager-configmap.yaml: add a newreceiverand a newroute -
alert-handler: add an email templateinform-user-job-status-change
Archive
Problems of watching k8s Framework object: not stable, may miss certain status change
Proposal 1
- add a container
framework-status-notification-pollerinalert-manager, which- watch framework through k8s API
- send the alert when a framework fails & this feature is enabled
Proposal 2
-
Job Exporter:
- add a container, which monitor Framework status & export the following metric:
- job_status(job_name="demo_job", username="demo_user",virtual_cluster="nni", status="running", pai_service_name="job-exporter", notification_status=["succeed", "failed"])
- value: 0/1/2/3 (waiting/running/succeed/failed)
- export value only at job status changes instead of exporting with a fixed frequency
- add a container, which monitor Framework status & export the following metric:
-
Benefits: useful for
averageWaitingTime,failingRate, & other statistics -
Prometheus:
- alert: PAIJobFSucceed
expr: max by (job_name) job_status{notification_status.includes("succeed")}[1m] == 2
labels:
severity: warn
# - alert: PAIJobFailed
# expr: changes(job_status{failureNotification="true"}[1m]) > 0 and job_status == 3
# labels:
# severity: warn
the notification is also useful when the job succeeds. maybe the feature could be rephrased as: notifying the user when a job completes.
related to https://github.com/microsoft/pai/issues/2235 https://github.com/microsoft/pai/issues/3640
Work Items
Part 1: Job / User configuration
-
[ ] Job configuration
- [x] add a field in protocal:
extra -> informAtFailureP0- [x] no need to update protocol since
extrais free object- (webportal & rest-server schema validation: refer to #5277)
- (protocal : refer to https://github.com/microsoft/openpai-protocol/pull/9/files)
- [x] parse the field when submit job #5491
- [x] no need to update protocol since
- [ ] webportal job submission page design P2
- [x] add a field in protocal:
-
[ ] User Configuration
- [ ] add a field in user info
extra -> getJobStatusChangeNotificationByP1 - [ ] webportal user profile page P2
- [ ] add a field in user info
-
[ ] Filter alerts by users P1
- [ ] add a Rest API #5407
- [ ] webportal change: use REST API instead of alert-manager API
Part 2: monitor & trigger corresponding alerts P0
- [x] add the following columns to the
frameworktable in DB to save job config & alerts state : P0 #5277 - [ ] add a container
job-status-change-notificationinalert-manager, which P0 #5493- watch DB
frameworktable - send the alert when the config is enabled & the alert has not been sent
- update
frameworktable after successfully sending alerts toalert-manager
- watch DB
Part 3: alerts handling P0 #5492
- [x]
src/alert-manager/deploy/alert-manager-configmap.yaml: add a newreceiverand a newroute - [x]
alert-handler: add an email templatejob-status-change-alert
Doc
- [ ] admin manual & user manual P0 #5494