O+M 2024-02-19

Open rshewitt opened this issue 1 year ago • 1 comments

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

[ ] Check Production State/Actions

Note: Catalog Auto Tasks You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

[ ] Check auto generated O&M tickets from no status column
[ ] Check Harvesting Emails
[ ] New Relic Alerts Triaged
[ ] Triage DMARC Report from Google

Weekly Checklist

[ ] DB-Solr Sync
[ ] Audit Log (more info on AU-3 and AU-6 Log auditing)
[ ] Tracking Update
- NOTE: This job will consistently timeout, but it is processing results ((more details)[https://github.com/GSA/data.gov/issues/4345])
[ ] Check Catalog Solr
[ ] Catalog Dupe Check

Monthly Checklist

[ ] Invicti Scan

ad-hoc checklist

[ ] audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

Watch for user email requests
Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.
Monitor and improve Data.gov O&M Dashboard
Update and revise Data.gov O&M Tasks

Feb 20 '24 15:02 rshewitt

recent bot activity caused catalog and nginx to crash. this activity has stopped for now. both are back up. we've identified the ip ranges and will monitor future activity. the request urls are often malicious because they're looking for .env files or bypassing our cache by querying random text. we have 2 options to prevent this from happening in the future.

Feb 22 '24 16:02 rshewitt