O+M 2024-02-19
As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.
Check the O&M Rotation Schedule for future planning.
Acceptance criteria
You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.
Daily Checklist
Note: Catalog Auto Tasks You will need to update the chart values manually. Click the Action link in each issue and grab the values from
monitor task outputandcheck runtime.
- [ ] Check auto generated O&M tickets from no status column
- [ ] Check Harvesting Emails
- [ ] New Relic Alerts Triaged
- [ ] Triage DMARC Report from Google
Weekly Checklist
- [ ] DB-Solr Sync
- [ ] Audit Log (more info on AU-3 and AU-6 Log auditing)
- [ ] Tracking Update
- NOTE: This job will consistently timeout, but it is processing results ((more details)[https://github.com/GSA/data.gov/issues/4345])
- [ ] Check Catalog Solr
- [ ] Catalog Dupe Check
Monthly Checklist
- [ ] Invicti Scan
ad-hoc checklist
- [ ] audit/review applications on cloud foundry and determine what can be stopped and/or deleted.
Reference
- Watch for user email requests
- Watch in #datagov-alerts and Vulnerable dependency notifications (daily email reports) for critical alerts.
- Monitor and improve Data.gov O&M Dashboard
- Update and revise Data.gov O&M Tasks
recent bot activity caused catalog and nginx to crash. this activity has stopped for now. both are back up. we've identified the ip ranges and will monitor future activity. the request urls are often malicious because they're looking for .env files or bypassing our cache by querying random text. we have 2 options to prevent this from happening in the future.