Better logging for MCM

Open prashanth26 opened this issue 7 years ago • 1 comments

Story

As operator and user I want to see an event log showing who/what did what to my machine objects, so that I can retrospectively analyse issues.

It is useful to understand what a human user or automated process did to the machines to explain (and next time prevent) issues.
Logging on MCM logs should be more human understandable with proper timestamps.

[ ] MCM logging style doesn't specify the timestamp properly (only time right now), we need to have better logging styles.
[ ] Need to remove unwanted error messages on MCM that flood MCM logs.
[x] Added or removed or replaced a machine and why (e.g. detected DIskPressure, instructed by cluster autoscaler), although such logs exist today, we need to make them more human readable.
[x] Update machine health check status with the exact reason on machine objects.
[ ] Need to centralize this logging, to have a mechanism to read the important logs by a higher level controller like Gardener #236.
[ ] Log machine creation/drain/deletion times (possibly export them to the metrics also?)

You can now see who modified machine objects and what was modified as well as see what the machine controller did to the worker pools and why.

[ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
[ ] Unit tests are provided: Have you written automated unit tests?
[ ] Integration tests are provided: Have you written automated integration tests?
[ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
[ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
[ ] User documentation: Have you updated the READMEs/documentation about user-relevant changes?

Aug 20 '18 08:08 prashanth26

Need to figure out a way to make sense of the roll-outs and alert/monitor any usual/unusual behaviour.

Mar 27 '19 09:03 prashanth26