containers-roadmap CoreDNS causing an issue to InsufficientNumberOfReplicas - The add-on is unhealthy because it doesn't have the desired number of replicas.[service] [request]: describe request here

Problem Description

CoreDNS causing an issue to InsufficientNumberOfReplicas - The add-on is unhealthy because it doesn't have the desired number of replicas.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name                   = local.name
  cluster_version                = local.cluster_version

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  enable_irsa = true

  enable_cluster_creator_admin_permissions = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = { 
   cis_ami = {
      instance_types = ["m5.large"]

      ami_id  = data.aws_ami.image.id

      # # This will ensure the bootstrap user data is used to join the node
      enable_bootstrap_user_data = true

      iam_role_attach_cni_policy = true

      min_size     = 1
      max_size     = 6
      desired_size = 4      

    }
  }

  # EKS Addons
  cluster_addons = {
    coredns    = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    # aws-ebs-csi-driver   = {
    #   service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
    # }
    vpc-cni = {

      before_compute = true
      most_recent    = true 
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
        }
      })
    }
  }

  tags = local.tags
}

Security disclosures

I noticed that CoreDNS pods are crashing specifically when using the CIS EKS-Optimized Amazon Linux 2 Level 1 Image with a custom AMI."

❯ k get po -n kube-system
NAME                                                         READY   STATUS             RESTARTS         AGE
aws-load-balancer-controller-54f58989fd-hj848                0/1     CrashLoopBackOff   17 (3m49s ago)   70m
aws-load-balancer-controller-54f58989fd-k2qzn                0/1     CrashLoopBackOff   17 (4m1s ago)    70m
aws-node-9bkt8                                               2/2     Running            0                71m
aws-node-psdvq                                               2/2     Running            0                71m
aws-node-qmhxg                                               2/2     Running            0                71m
aws-node-xl99d                                               2/2     Running            0                71m
cluster-autoscaler-aws-cluster-autoscaler-848fbf899c-8nxls   0/1     CrashLoopBackOff   16 (90s ago)     66m
coredns-557586b4b9-hnlg5                                     0/1     Running            0                64m
coredns-6f99ddbc54-pkltm                                     0/1     Running            0                56m
coredns-6f99ddbc54-xw65l                                     0/1     Running            0                56m
ebs-csi-controller-576c8d5c58-4q6vc                          6/6     Running            0                69m
ebs-csi-controller-576c8d5c58-qk6m9                          6/6     Running            0                70m
ebs-csi-node-5fztg                                           1/3     CrashLoopBackOff   40 (3m12s ago)   71m
ebs-csi-node-7tnrt                                           1/3     CrashLoopBackOff   41 (2m47s ago)   71m
ebs-csi-node-bmpqh                                           2/3     CrashLoopBackOff   41 (3m1s ago)    71m
ebs-csi-node-splqt                                           1/3     CrashLoopBackOff   39 (3m48s ago)   71m
kube-proxy-76gkn                                             1/1     Running            0                71m
kube-proxy-gkhcn                                             1/1     Running            0                71m
kube-proxy-hqxw7                                             1/1     Running            0                71m
kube-proxy-kxfds                                             1/1     Running

-->

Which service(s) is this request for? EKS

Attachments

Snip20240409_29 Snip20240409_30

Apr 10 '24 04:04 eravindar12

Do we have a fix on this?

Aug 12 '24 13:08 caquinomrge

Event History in CloudTrail may provide more information about why your nodes are crashing. In our case, it was actually related to not having all the required resource tags in place.

Aug 12 '24 14:08 eddievb-moodys