[Epic] Monitoring
Details
We require a metrics and logs aggregation solution with notification in order to have a better understanding of our platform. This will allow us to address before our users report them and improve our resource usage efficiency.
Tasks
- Getting up and running:
- [x] #884
- [ ] MVP
- [x] #844
- [x] #971
- [x] move to authenticated gateway
- [x] useruc nodes cannot communicate with elasticsearch-http service
- [x] #908
- [x] #845
- [ ] #914
- [x] #915
- [x] #916
- [x] #1462
- [x] #1463
- [x] #919
- [ ] Eval logs and metrics re require alerting
- [ ] Create escalation plan
- [ ] Define metric mvp
- [ ] Define app log mvp
- [ ] Create/Mod specific app repo
- [ ] Implement in Dev
- [ ] Testing
- [ ] Implement in Prod
- [ ] Status/uptime
- [ ] Service status user page
- [x] #366
-
Cluster stability and cost management monitoring enhancements
- [ ] #1365
- [ ] #1364
- [ ] #1363
- [ ] #1366
- [x] #1369
-
Previous issues opened, but will no longer do
- [ ] ~Metrics~
- [x] #841
- [x] #842
-
Improvements / Future features
- [ ] Notification solution implementation
- [ ] Uptime Kuma PoC
- [ ] External monitoring
- [ ] Grafana integration w/ Kubeflow user dashboards
- [ ] Cluster quota notifications
- [ ] Elastic monitoring of elastic
- [ ] Elastic application performance poc
-
Wish list
- [ ] node pools
Idea for future feature for authentication without paying for the integrated OIDC capability: https://robrankin.github.io/posts/kibaba-oauth-kubernetes/
@Jose-Matsuda this was a grouped list of items that we wanted to implement during our original go at monitoring. Have a look at this epic and use it for inspiration if you'd like. If there are things on here that we have done already, update the their statuses. :-)