MCM marking Unknown machines Running on Unfreezing
How to categorize this issue?
/area robustness /kind bug /priority 3
What would you like to be added:
Currently machine controller gets frozen if we can't reach KAPI using the domain provided in kubeconfig. Its gets unfrozen again once KAPI is reachable but we mark all the Unknown machines as Running again .
So that logic needs to be removed or maybe enhanced.
Why is this needed:
This state change without knowing why or when they were marked Unknown is not a good practice. Although its there to prevent a scenario where we start marking the machines as Failed since they had been Unknown , but the reason for Unknown was KAPI unreachable and not kubelet down or node unhealthy.
NOTE
If this is done , MCM FAQs and docs needs to be updated as well, and unit tests needs to adapted. Also if its possible improvement in FAQ to explain difference b/w API server unreachable freezing of machine controller and overshooting freezing of machineset and machinedeployment is needed.
The machines go to Unknown state only from Running state as also mentioned here https://github.com/gardener/machine-controller-manager/pull/180#discussion_r228443386
So marking the Unknown to Running makes sense , but doing it just to reset the health check doesn't. It could be reset by keeping the state as Unknown just changing the LastUpdateTime to time.Now()
In that way no harm would be made to the Unknown machine.
We discussed this in the grooming and note our observations and a proposed solution. Our design will need to be refined when we pick this up for implementation.
Justification for Current Behaviour
This code to mark Unknown Machines back to Running was introduced in the machine safety controller reconcile function to address a specific corner case:
- In the regular health check reconciliation of the machine controller, say that a node was detected as not healthy. (node ready condition false or one of the problem node conditions was true). Then the machine controller moves the Machine Phase to
Unknown. Note that the Machine Phase is moved back toRunning, if the node becomes healthy again. However, if the Machine Phase remainsUnknownfor the duration of the configured health-check-timeout, then it is transitioned to phase:Failed. (Failedmachines subsequently go through the node drain, vm-deletion and node-deletion flows which will not be described here.). Expression: :if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed
- Let us say that the API Server went down after a machine
M1was just transitioned toUnknownphase. - The safety controller freezes the MC once the API server (either control or target) is detected as down. This disables all the regular reconcile activities: lifecycle/drain/health-check etc. The only guy who runs is the safety controller.
- The API Server came back up after some time.
- The safety controller unfreezes the MC after a time that exceeds the health-check-timeout.
- The health check now runs. The machine
M1is seen to inUnknownstate and notes that the Node associated with this machine is still un-healthy since the KCM is yet to update the node conditions. But the backing VM instance is actually healthy - the KCM simply hasn't had the chance yet to update the latest Node conditions. - The MC sees that the Machine has been in
Unknownstate for a time exceeding the configured health-check-timeout. - The MC transitions the Machine phase to
Failed. The drain and deletion flows are subsequently initiated. - Unfortunately, a healthy machine was shutdown.
To address this, the current machine safety controller also moves back phase: Unknown machines back to phase: Running state after freezing the MC controller. So, effectively the health-check is reset. When the next health check runs, then any machines that are healthy remain in Running. Any machines who are not health are transitioned back to Unknown.
Our default health-check-timeout is 10min.
Issue
The problem with the above hack is that if machine went un-healthy at time: T1, was unhealthy for 9min :T2=T1+9 , and API server went down and came back up after 3 minutes: T3=T2+3, then we again wait for 10 min T4=T3+10 before moving machine to Failed (in case it didn't pass the health-check). We effectively double the wait for an truly un-healthy machine before draining and deleting the same. This causes a lot of delay.
Solution Proposal
-
reconcileMachineSafetyAPIservershould only toggle freeze or unfreeze , it should not set theUnknownmachines toRunning. Phase transition is the job of the health check. Individual actors should have segregated responsibilities. -
reconcileHealthinstead of blindly moving allUnknownand un-healthy machines toFailedphase, also checks to see if the MC was frozen and un-frozen since its last run time. If so, then it gives aunfrozenGracePeriodso that KCM gets the opportunity to update the Node conditions. OLD:if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as FailedNEW:if time.Now()-machineStatusLastUpdateTime-unfrozenGracePeriod> timeoutDuration then mark phase as Failed - The above ensures that we do not needlessly delay moving a truly failed machine to the
Failedphase. It also ensures that we do not spuriously move a truly Running machine to theFailedphase.