machine-controller-manager icon indicating copy to clipboard operation
machine-controller-manager copied to clipboard

Better logging for MCM

Open prashanth26 opened this issue 7 years ago • 1 comments

Story

As operator and user I want to see an event log showing who/what did what to my machine objects, so that I can retrospectively analyse issues.

Motivation

  • It is useful to understand what a human user or automated process did to the machines to explain (and next time prevent) issues.
  • Logging on MCM logs should be more human understandable with proper timestamps.

Acceptance Criteria

  • [ ] MCM logging style doesn't specify the timestamp properly (only time right now), we need to have better logging styles.
  • [ ] Need to remove unwanted error messages on MCM that flood MCM logs.
  • [x] Added or removed or replaced a machine and why (e.g. detected DIskPressure, instructed by cluster autoscaler), although such logs exist today, we need to make them more human readable.
  • [x] Update machine health check status with the exact reason on machine objects.
  • [ ] Need to centralize this logging, to have a mechanism to read the important logs by a higher level controller like Gardener #236.
  • [ ] Log machine creation/drain/deletion times (possibly export them to the metrics also?)

Release Notes

  • You can now see who modified machine objects and what was modified as well as see what the machine controller did to the worker pools and why.

Definition of Done

  • [ ] Knowledge is distributed: Have you spread your knowledge in pair programming/code review?
  • [ ] Unit tests are provided: Have you written automated unit tests?
  • [ ] Integration tests are provided: Have you written automated integration tests?
  • [ ] Minimum API exposure: If you have added/changed public API, was it really necessary/is it minimal?
  • [ ] Operations guide: Have you updated the operations guide about ops-relevant changes?
  • [ ] User documentation: Have you updated the READMEs/documentation about user-relevant changes?

prashanth26 avatar Aug 20 '18 08:08 prashanth26

Need to figure out a way to make sense of the roll-outs and alert/monitor any usual/unusual behaviour.

prashanth26 avatar Mar 27 '19 09:03 prashanth26