machine-controller-manager MCM does not reset the failed_machines gauge once the machine is deleted

What happened: When a machine is no longer reachable, MCM correctly reports the machine as unhealthy and creates the corresponding failed_machines gauge metric. However when the machine is actually deleted that metric is not cleared and it keeps on being reported. This makes it problematic for alerting rules based on the failed_machines metric as it never gets reset to 0. I may be easier to change this metric to a counter, so you don't have to worry about resetting its value, as keeping track of the failed machines may not be trivial or wanted.

I0616 23:12:14.072380       1 event.go:255] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager", Name:"mcm-immutable-node-az-b-6cb7bb5d9", UID:"31b3fc43-a668-11ea-b0fa-0a1dc5f423b0", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"371959992", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: mcm-immutable-node-az-b-6cb7bb5d9-7zgk9
E0616 23:12:12.968331       1 machine.go:931] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is not healthy since 10m0s minutes. Changing status to failed. Node Conditions: [...]  {Type:Ready Status:Unknown LastHeartbeatTime:2020-06-16 22:59:55 +0000 UTC LastTransitionTime:2020-06-16 23:01:18 +0000 UTC Reason:NodeStatusUnknown Message:Kubelet stopped posting node status.}
W0616 23:01:38.218468       1 machine.go:729] Machine mcm-immutable-node-az-b-6cb7bb5d9-7zgk9 is unhealthy - changing MachineState to Unknown

What you expected to happen: failed_machines gauge metric should be reset once the machines is deleted (or recovers)

How to reproduce it (as minimally and precisely as possible):

Try creating a machine that doesn't bootstrap and watch the number of failed machines keep going up: count(mcm_machine_deployment_failed_machines)
Create a machine, then delete the underlying VM and observe the value of failed_machines

Anything else we need to know:

Environment:

kubernetes 1.12
image: eu.gcr.io/gardener-project/gardener/machine-controller-manager:v0.26.3

Interpretation/Solutions

We need a new gauge metric by the name num_failed_machines which tells the number of machines which are in Failed Phase currently
mcm_machine_deployment_operation_failed gauge metric -> for the machines last operation failed
gauge here means that the metric won't repeat, slice needs to be cleared once the machine is deleted

Jun 18 '20 18:06 sebbonnet

@sebbonnet Thank you for your contribution.

Jun 18 '20 18:06 gardener-robot

cc @ggaurav10

Aug 14 '20 03:08 hardikdr

@rfranzke commented on Apr 29

What happened:
MCM does not update the .status.failedMachine of the MachineDeployment after the .status.lastOperation of the Machine changes (e.g., from Failed -> Processing (e.g., after the credentials have been fixed)):

  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2020-04-29T06:51:38Z"
      lastUpdateTime: "2020-04-29T06:51:38Z"
      message: Deployment does not have minimum availability.
      reason: MinimumReplicasUnavailable
      status: "False"
      type: Available
    failedMachines:
    - lastOperation:
        description: 'Failed to list VMs while deleting the machine "shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5"
          AuthFailure: AWS was not able to validate the provided access credentials
          status code: 401, request id: 6e99231c-654e-4b05-8801-310e3532b4e9'
        lastUpdateTime: "2020-04-29T06:53:33Z"
        state: Failed
        type: Delete
      name: shoot--foo--bar-cpu-worker-z1-5cdcb46f64-pxzp5
      ownerRef: shoot--foo--bar-cpu-worker-z1-5cdcb46f64
    observedGeneration: 2
  spec:
    class:
      kind: AWSMachineClass
      name: shoot--foo--bar-cpu-worker-z1-ff76e
    nodeTemplate:
      metadata:
        creationTimestamp: null
        labels:
          node.kubernetes.io/role: node
          worker.garden.sapcloud.io/group: cpu-worker
          worker.gardener.cloud/pool: cpu-worker
      spec: {}
    providerID: aws:///eu-west-1/i-05f4737c3ef646f89
  status:
    currentStatus:
      lastUpdateTime: "2020-04-29T07:41:44Z"
      phase: Pending
      timeoutActive: true
    lastOperation:
      description: Creating machine on cloud provider
      lastUpdateTime: "2020-04-29T07:41:44Z"
      state: Processing
      type: Create
    node: ip-10-250-9-55.eu-west-1.compute.internal
(compare the timestamps)

What you expected to happen:
The .status.failedMachines is properly updated when .status.lastOperation of Machine objects are changed.

Aug 14 '20 03:08 hardikdr