MCM does not reset the failed_machines gauge once the machine is deleted
What happened:
When a machine is no longer reachable, MCM correctly reports the machine as unhealthy and creates the corresponding failed_machines gauge metric. However when the machine is actually deleted that metric is not cleared and it keeps on being reported.
This makes it problematic for alerting rules based on the failed_machines metric as it never gets reset to 0.
I may be easier to change this metric to a counter, so you don't have to worry about resetting its value, as keeping track of the failed machines may not be trivial or wanted.
I0616 23:12:14.072380 1 event.go:255] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager", Name:"mcm-immutable-node-az-b-6cb7bb5d9", UID:"31b3fc43-a668-11ea-b0fa-0a1dc5f423b0", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"371959992", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: mcm-immutable-node-az-b-6cb7bb5d9-7zgk9
E0616 23:12:12.968331 1 machine.go:931] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is not healthy since 10m0s minutes. Changing status to failed. Node Conditions: [...] {Type:Ready Status:Unknown LastHeartbeatTime:2020-06-16 22:59:55 +0000 UTC LastTransitionTime:2020-06-16 23:01:18 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}
W0616 23:01:38.218468 1 machine.go:729] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is unhealthy - changing MachineState to Unknown
What you expected to happen:
failed_machines gauge metric should be reset once the machines is deleted (or recovers)
How to reproduce it (as minimally and precisely as possible):
- Try creating a machine that doesn't bootstrap and watch the number of failed machines keep going up:
count(mcm_machine_deployment_failed_machines) - Create a machine, then delete the underlying VM and observe the value of
failed_machines
Anything else we need to know:
Environment:
- kubernetes 1.12
- image:
eu.gcr.io/gardener-project/gardener/machine-controller-manager:v0.26.3
Interpretation/Solutions
- We need a new gauge metric by the name
num_failed_machineswhich tells the number of machines which are inFailedPhase currently -
mcm_machine_deployment_operation_failedgauge metric -> for the machines last operation failed - gauge here means that the metric won't repeat, slice needs to be cleared once the machine is deleted
@sebbonnet Thank you for your contribution.
cc @ggaurav10
@rfranzke commented on Apr 29
What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2020-04-29T06:51:38Z"
lastUpdateTime: "2020-04-29T06:51:38Z"
message: Deployment does not have minimum availability.
reason: MinimumReplicasUnavailable
status: "False"
type: Available
failedMachines:
- lastOperation:
description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
AuthFailure: AWS was not able to validate the provided access credentials
status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
lastUpdateTime: "2020-04-29T06:53:33Z"
state: Failed
type: Delete
name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
observedGeneration: 2
spec:
class:
kind: AWSMachineClass
name: shoot--foo--bar-cpu-worker-z1-ff76e
nodeTemplate:
metadata:
creationTimestamp: null
labels:
node.kubernetes.io/role: node
worker.garden.sapcloud.io/group: cpu-worker
worker.gardener.cloud/pool: cpu-worker
spec: {}
providerID: aws:///eu-west-1/i-05f4737c3ef646f89
status:
currentStatus:
lastUpdateTime: "2020-04-29T07:41:44Z"
phase: Pending
timeoutActive: true
lastOperation:
description: Creating machine on cloud provider
lastUpdateTime: "2020-04-29T07:41:44Z"
state: Processing
type: Create
node: ip-10-250-9-55.eu-west-1.compute.internal
(compare the timestamps)
What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.