eksctl icon indicating copy to clipboard operation
eksctl copied to clipboard

[Bug] Degraded managedNodeGroups when using a pathed instanceRoleARN

Open matschaffer-roblox opened this issue 1 year ago • 6 comments

What were you trying to accomplish?

We launch EKS clusters using instanceRoleARN to attach managed policies (AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly) to our node group instances.

We provided a path on these roles of "/eks/" for organizational purposes. We'd like to be able to manage these node groups, but the pathing seems to cause a degradation in node group health.

What happened?

The cluster creates as expected but after about an hour or so the node group shows up as degraded

Screenshot_2024-06-23_at_9_16_37 PM

Screenshot_2024-06-23_at_9_17_41 PM

It's a little tough to tell with the redactions, but the ARN shown in the "Affected resources" column lacks the /eks/ path prefix.

Removing the path parameter from the role seems to avoid the issue.

How to reproduce it?

We use a eksctl config template like this:

managedNodeGroups:
  - name: stable-{{ .CLUSTER_NAME_WITH_HYPHENS }}
    instanceType: r5.8xlarge
    desiredCapacity: 2
    minSize: 2
    maxSize: 2
    privateNetworking: true
    volumeSize: 40
    volumeType: gp3
    volumeEncrypted: true
    labels:
      stable: "true"
    tags:
      <<: *tags
    iam:
      instanceRoleARN: {{ .STABLE_NODES_ROLE_ARN }}

Where the instance role ARN is "arn:aws:iam::ACCOUNT:role/eks/ROLE_NAME"

Logs

Output from eksctl during creation is normal.

Anything else we need to know?

What OS are you using? macos Are you using a downloaded binary or did you compile eksctl? downloaded via asdf What type of AWS credentials are you using (i.e. default/named profile, MFA)? SSO

Versions

❯ eksctl info   
eksctl version: 0.183.0
kubectl version: v1.30.2
OS: darwin

matschaffer-roblox avatar Jun 24 '24 04:06 matschaffer-roblox

Hello matschaffer-roblox :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

github-actions[bot] avatar Jun 24 '24 04:06 github-actions[bot]

Removing the /eks/ path from the role seems to be a viable workaround (arn:aws:iam::ACCOUNT:role/ROLE_NAME)

AWS support provided some steps for their reproduction of the issue:


Step 1 => I created a trust policy with the below mentioned content:

  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["sts:AssumeRole"],
      "Principal": {
        "Service": ["ec2.amazonaws.com"]
      }
    }
  ]
}

Step 2 => I created a role with path using the below mentioned command:

aws iam create-role --role-name test-node-role --assume-role-policy-document file://assume-role-doc.json --path /eks/

Step 3 => I created an EKS cluster and nodegroup with the below mentioned config file "eksctl create cluster -f test.yaml" :

apiVersion: [eksctl.io/v1alpha5](http://eksctl.io/v1alpha5)
kind: ClusterConfig

metadata:
name: my-cluster2
region: ap-south-1
version: "1.29"

accessConfig:
bootstrapClusterCreatorAdminPermissions: true
authenticationMode: API

managedNodeGroups:
- name: ng-2
instanceType: t3.large
desiredCapacity: 2
volumeSize: 20
iam:
instanceRoleARN: "arn:aws:iam::55555555555:role/eks/test-node-role"

Step 4 => The nodegroup that craeted shows IAM role as arn:aws:iam::55555555555:role/test-node-role" on the EKS console. The access entry that is created automatically has the complete path "/eks/" included but it is stripped from the node group. The CreateNodegroup API call and Cloudformation stack show below mentioned configuration for node role passed:

CFN:
"NodeRole": "arn:aws:iam::55555555555:role/test-node-role",
"NodegroupName": "ng-2",

Cloudtrail:
"nodeRole": "arn:aws:iam::55555555555:role/test-node-role",
"name": "my-cluster2",

So, eksctl seems to be stripping the path from the node role which is eventually leading to health issues on the node with the error "access entry not found in cluster".

matschaffer-roblox avatar Jun 27 '24 19:06 matschaffer-roblox

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 28 '24 01:07 github-actions[bot]

Bump for stalebot

matschaffer-roblox avatar Jul 28 '24 20:07 matschaffer-roblox

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Aug 28 '24 01:08 github-actions[bot]

Bump for stalebot

matschaffer-roblox avatar Aug 28 '24 19:08 matschaffer-roblox