machine-controller-manager
machine-controller-manager copied to clipboard
Better logging for MCM
Story
As operator and user I want to see an event log showing who/what did what to my machine objects, so that I can retrospectively analyse issues.
Motivation
- It is useful to understand what a human user or automated process did to the machines to explain (and next time prevent) issues.
- Logging on MCM logs should be more human understandable with proper timestamps.
Acceptance Criteria
- [ ] MCM logging style doesn't specify the timestamp properly (only time right now), we need to have better logging styles.
- [ ] Need to remove unwanted error messages on MCM that flood MCM logs.
- [x] Added or removed or replaced a machine and why (e.g. detected DIskPressure, instructed by cluster autoscaler), although such logs exist today, we need to make them more human readable.
- [x] Update machine health check status with the exact reason on machine objects.
- [ ] Need to centralize this logging, to have a mechanism to read the important logs by a higher level controller like Gardener #236.
- [ ] Log machine creation/drain/deletion times (possibly export them to the metrics also?)
Release Notes
- You can now see who modified machine objects and what was modified as well as see what the machine controller did to the worker pools and why.
Definition of Done
- [ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
- [ ] Unit tests are provided: Have you written automated unit tests?
- [ ] Integration tests are provided: Have you written automated integration tests?
- [ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
- [ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
- [ ] User documentation: Have you updated the READMEs/documentation about user-relevant changes?
Need to figure out a way to make sense of the roll-outs and alert/monitor any usual/unusual behaviour.