passport icon indicating copy to clipboard operation
passport copied to clipboard

Monitoring of Cron Jobs

Open nutrina opened this issue 2 years ago • 0 comments

User Story:

As a developer, I want to ensure that our cron jobs are running successfully, so that all data processing and export operations are performed as expected and on schedule.

Acceptance Criteria

GIVEN a cron job is scheduled to run recurrently, WHEN a run fails, THEN a PagerDuty (PD) alarm should be triggered to alert the team.

Documentation and Monitoring Overview: As part of managing this feature, it's crucial to maintain a current and comprehensive record of all monitoring configurations and their statuses. For each task or update, the Notion page on Passport Monitors & PD Alarms must be updated to reflect the latest state and provide an overview of the monitoring topic. This will ensure transparency and continuity in monitoring practices.

Product & Design Links:

Tech Details:

  • Monitoring Options:
    • Uptime Robot for Cron Job Monitoring: Consider using Uptime Robot's cron job monitoring feature, which can check the heartbeat of cron jobs by receiving pings. For details, visit Uptime Robot Cron Job Monitoring.
    • Custom Scheduled Lambda: Create a Lambda function that runs at scheduled intervals to check the status of cron jobs and reports failures. This function could query logs or job status in a database to determine job health.
    • Enhanced Current Monitoring: Continue with the current method of monitoring scheduled task starts and completions but improve the reliability and accuracy of the setup to ensure all tasks are captured and errors are reported.

Open Questions:

  • [ ] Which monitoring solution will provide the best balance between reliability, cost, and ease of implementation?
  • [ ] Are there any specific metrics or additional data points we should capture about the cron jobs to aid in troubleshooting and performance monitoring?

Notes/Assumptions:

  • Assume that the current infrastructure supports integrating new monitoring solutions with minimal changes.
  • Assume that the reliability issues with the current setup can be identified and corrected through an audit of the existing monitoring configurations.

nutrina avatar Apr 19 '24 06:04 nutrina