pai icon indicating copy to clipboard operation
pai copied to clipboard

Inform the user when jobs status change

Open suiguoxin opened this issue 4 years ago • 3 comments

Motivation

Some jobs may fail unexpectedly. If the users can be informed when the jobs fail, the users will be able to handle the issue in time. This will save the users from checking their job status all the time.

Similar for other status changes.

Background:

  • This function should be set by job instead of user
  • The trigger event can be
    • Start Running
    • Failed
    • Succeeded
    • WaitingTooLong
  • Notification can be sent to users by email / webportal and this should be configurable
    • some notification methods maybe not available if the admin doesn't enable it

Design

Workflow:

  • Part 1: Job configuration
  • Part 2: monitor & trigger corresponding alerts
  • Part 3: alerts handling

Part 1: Job / User configuration

  • What alerts to send is configured by job:
    • enable this feature in job protocal, in the field extras -> jobStatusChangeNotification
    • support further modification after jobs get submitted
extras:
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
  hivedScheduler:
    taskRoles:
      taskrole:
        skuNum: 1
        skuType: GENERIC-WORKER
  jobStatusChangeNotification: 
    running: false
    succeeded: true
    stopped: false
    failed: true
    retried: false
  • How to send alerts is configured by user: set in user-profile page, user can select from these available actions:
    • [ ] webportal notification
    • [ ] email notification: this action will only be available when : 1) user email is not empty; 2) email-user action is available in alert-handler
{
    "username": "gusui",
    "email": "[email protected]",
    "extension": {
        "sshKeys": [],
        "getJobStatusChangeNotificationBy": 
            email: true,
            webportal: true
    },
}

Part 2: monitor & trigger corresponding alerts

design with DB

  • add the following columns to the framework table in DB:
notificationAtRunning | BOOLEAN
notifiedAtRunning | BOOLEAN
notificationAtSucceeded | BOOLEAN
notifiedAtSucceeded | BOOLEAN
notificationAtFailed | BOOLEAN
notifiedAtFailed | BOOLEAN
notificationAtRetried | BOOLEAN
notifiedAtRetried | INTERGER (the Nth retry has been notified)

these columns are used to save job config & alerts state

  • add a container framework-status-notification-poller in alert-manager, which
    • watch DB framework table
    • send the alert when the config is enabled & the alert has not been sent
    • update framework table after successfully sending alerts to alert-manager

Part 3: alerts handling

  • src/alert-manager/deploy/alert-manager-configmap.yaml: add a new receiver and a new route
  • alert-handler: add an email template inform-user-job-status-change

Archive

Problems of watching k8s Framework object: not stable, may miss certain status change

Proposal 1

  • add a container framework-status-notification-poller in alert-manager, which
    • watch framework through k8s API
    • send the alert when a framework fails & this feature is enabled

Proposal 2

  • Job Exporter:

    • add a container, which monitor Framework status & export the following metric:
      • job_status(job_name="demo_job", username="demo_user",virtual_cluster="nni", status="running", pai_service_name="job-exporter", notification_status=["succeed", "failed"])
      • value: 0/1/2/3 (waiting/running/succeed/failed)
      • export value only at job status changes instead of exporting with a fixed frequency
  • Benefits: useful for averageWaitingTime, failingRate, & other statistics

  • Prometheus:

- alert: PAIJobFSucceed
  expr: max by (job_name) job_status{notification_status.includes("succeed")}[1m] == 2
  labels: 
    severity: warn
# - alert: PAIJobFailed
#   expr: changes(job_status{failureNotification="true"}[1m]) > 0 and job_status == 3
#   labels: 
#     severity: warn

suiguoxin avatar Mar 02 '21 08:03 suiguoxin

the notification is also useful when the job succeeds. maybe the feature could be rephrased as: notifying the user when a job completes.

fanyangCS avatar Mar 11 '21 08:03 fanyangCS

related to https://github.com/microsoft/pai/issues/2235 https://github.com/microsoft/pai/issues/3640

fanyangCS avatar Mar 11 '21 08:03 fanyangCS

Work Items

Part 1: Job / User configuration

  • [ ] Job configuration

    • [x] add a field in protocal: extra -> informAtFailure P0
      • [x] no need to update protocol since extra is free object
        • (webportal & rest-server schema validation: refer to #5277)
        • (protocal : refer to https://github.com/microsoft/openpai-protocol/pull/9/files)
      • [x] parse the field when submit job #5491
    • [ ] webportal job submission page design P2
  • [ ] User Configuration

    • [ ] add a field in user info extra -> getJobStatusChangeNotificationBy P1
    • [ ] webportal user profile page P2
  • [ ] Filter alerts by users P1

    • [ ] add a Rest API #5407
    • [ ] webportal change: use REST API instead of alert-manager API

Part 2: monitor & trigger corresponding alerts P0

  • [x] add the following columns to the framework table in DB to save job config & alerts state : P0 #5277
  • [ ] add a container job-status-change-notification in alert-manager, which P0 #5493
    • watch DB framework table
    • send the alert when the config is enabled & the alert has not been sent
    • update framework table after successfully sending alerts to alert-manager

Part 3: alerts handling P0 #5492

  • [x] src/alert-manager/deploy/alert-manager-configmap.yaml: add a new receiver and a new route
  • [x] alert-handler: add an email template job-status-change-alert

Doc

  • [ ] admin manual & user manual P0 #5494

suiguoxin avatar May 12 '21 07:05 suiguoxin