[Bug] Degraded managedNodeGroups when using a pathed instanceRoleARN
What were you trying to accomplish?
We launch EKS clusters using instanceRoleARN to attach managed policies (AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly) to our node group instances.
We provided a path on these roles of "/eks/" for organizational purposes. We'd like to be able to manage these node groups, but the pathing seems to cause a degradation in node group health.
What happened?
The cluster creates as expected but after about an hour or so the node group shows up as degraded
It's a little tough to tell with the redactions, but the ARN shown in the "Affected resources" column lacks the /eks/ path prefix.
Removing the path parameter from the role seems to avoid the issue.
How to reproduce it?
We use a eksctl config template like this:
managedNodeGroups:
- name: stable-{{ .CLUSTER_NAME_WITH_HYPHENS }}
instanceType: r5.8xlarge
desiredCapacity: 2
minSize: 2
maxSize: 2
privateNetworking: true
volumeSize: 40
volumeType: gp3
volumeEncrypted: true
labels:
stable: "true"
tags:
<<: *tags
iam:
instanceRoleARN: {{ .STABLE_NODES_ROLE_ARN }}
Where the instance role ARN is "arn:aws:iam::ACCOUNT:role/eks/ROLE_NAME"
Logs
Output from eksctl during creation is normal.
Anything else we need to know?
What OS are you using? macos Are you using a downloaded binary or did you compile eksctl? downloaded via asdf What type of AWS credentials are you using (i.e. default/named profile, MFA)? SSO
Versions
❯ eksctl info
eksctl version: 0.183.0
kubectl version: v1.30.2
OS: darwin
Hello matschaffer-roblox :wave: Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website
Removing the /eks/ path from the role seems to be a viable workaround (arn:aws:iam::ACCOUNT:role/ROLE_NAME)
AWS support provided some steps for their reproduction of the issue:
Step 1 => I created a trust policy with the below mentioned content:
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["sts:AssumeRole"],
"Principal": {
"Service": ["ec2.amazonaws.com"]
}
}
]
}
Step 2 => I created a role with path using the below mentioned command:
aws iam create-role --role-name test-node-role --assume-role-policy-document file://assume-role-doc.json --path /eks/
Step 3 => I created an EKS cluster and nodegroup with the below mentioned config file "eksctl create cluster -f test.yaml" :
apiVersion: [eksctl.io/v1alpha5](http://eksctl.io/v1alpha5)
kind: ClusterConfig
metadata:
name: my-cluster2
region: ap-south-1
version: "1.29"
accessConfig:
bootstrapClusterCreatorAdminPermissions: true
authenticationMode: API
managedNodeGroups:
- name: ng-2
instanceType: t3.large
desiredCapacity: 2
volumeSize: 20
iam:
instanceRoleARN: "arn:aws:iam::55555555555:role/eks/test-node-role"
Step 4 => The nodegroup that craeted shows IAM role as arn:aws:iam::55555555555:role/test-node-role" on the EKS console. The access entry that is created automatically has the complete path "/eks/" included but it is stripped from the node group. The CreateNodegroup API call and Cloudformation stack show below mentioned configuration for node role passed:
CFN:
"NodeRole": "arn:aws:iam::55555555555:role/test-node-role",
"NodegroupName": "ng-2",
Cloudtrail:
"nodeRole": "arn:aws:iam::55555555555:role/test-node-role",
"name": "my-cluster2",
So, eksctl seems to be stripping the path from the node role which is eventually leading to health issues on the node with the error "access entry not found in cluster".
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Bump for stalebot
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Bump for stalebot