[EKS] : Reduction in EKS cluster creation time
EKS cluster control plane provisioning time currently averages 15 minutes. We’ll use this issue to track the ongoing improvements we are making to reduce the creation time.
Which service(s) is this request for? EKS
Reducing the upgrade time would also be nice, especially since the upgrade today involves several manual steps, see https://github.com/aws/containers-roadmap/issues/600 . When I recently upgraded to 1.15 and 1.16, I remember something like 40, 45min per control plane upgrade until EKS reported that the upgrade had fully finished.
If upgrades could be faster, with the same reliability, this would be great. And it might be even more important than the creation time, assuming that clusters are upgraded several times in their lifetime.
Upgrading from 1.18 to 1.19 today took 47 mins according to Terraform logs for the control plane, then 34 mins to upgrade a single very small nodegroup (7 nodes), then you have to manually update core-dns, kube-proxy and aws-cni.
So overall you're talking an absolute minimum of 1.5 hours if you're watching the thing like a hawk and not wasting any time with gaps. This does seem a little crazy and unsustainable :\
I do hope it doesn't keep getting worse with future versions too...
@billinghamj thanks for the feedback. You are being heard. On a tangent (and not as a mean to respond to your specific question/need) I am wondering if you have considered using Fargate in your EKS deployments. Among other advantages, one thing that in the context of this thread would be interesting is the fact there are no nodes to upgrade and also that AWS embeds, as part of the Fargate, service components you don't need to care about (kubelet/proxy, cni, log routers, etc). In your extremely specific example you would "just" had to upgrade the control plane and core-dns. Just curious if you took Fargate into account and, if you did, what made you stick to "regular" EC2 worker nodes.
@mreferre in my opinion, running Fargate in EKS is not cost-effective:
- EKS adds overhead cost for the control plane,
- Fargate makes compute power more expensive, compared to EC2.
I don't think there's much of a use case to run the main EKS workloads in Fargate. Maybe it can be used for small tools (e.g. cron jobs), but not for apps that need a lot of CPU.
Side note: Price reductions of any kind are always welcome :-)
Thanks for the feedback @heidemn. The raw compute costs of Fargate are (on average) only roughly 20% more expensive than standard EC2 prices (after a consistent price reduction we announced a while back).
Ironically, I think that for very tiny/small workloads, Fargate isn't very cost effective given that the smallest pod size is 0.25vCPU/512MB of memory (and in most cases it would be more convenient to consolidate many tiny workloads on EC2 instances). However for larger workloads, assuming your pod utilization is high, Fargate may become cost effective pretty quickly given there are no worker nodes waste (most K8s clusters are utilized only at a fraction of their full capacity, which you are paying for). If you have a real life example of where you concluded that Fargate was more expensive could you share it please (even offline, I can be reached at mreferre @ amazon dot com).
Also, I did not mean this to become a distraction from the original question in this thread.
Thanks!
PS here a few more considerations around EKS/Fargate economics. I'd like to understand where these assumptions aren't correct (we want to learn more about practical cases where these assumptions are not applicable).
Thanks, that blog post is definitely helpful. I will give Fargate on EKS a try soon.
To close this (off-)topic: What I could imagine is that starting a Pod might be slower on Fargate than on EC2 (if the instance is already running). But the benefits of better isolation and not having to maintain servers are definitely not bad.
Our services tend to use around 30MB RAM, are IO bound, and we run hundreds/thousands of them on our non-prod cluster.
Even when using exclusively ARM instance types on spot, due to the pods-per-node limits, EKS is doing pretty poorly for us right now cost-wise (with Rancher, we ran the entire thing on a single instance with no performance issues). Obviously Fargate would exacerbate this massively
Aside from that, we generally are happier having a bit more control, and want as close as possible to a "plain vanilla" K8s setup. History has told us not to trust AWS too much when it comes to behind-the-scenes magic. Before managed spot instances were available, we were quite happy with self-managed nodes too
On principle, lack of ARM support is also a complete non-starter for us. We want to push for that future hard, so we're voting with our feet
@billinghamj this makes a lot of sense. Thanks for the feedback.
We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.
@kirtichandak Does this just apply to cluster creation or are some of these improvements going to be seen in upgrading a cluster version?
The change to reduce control plane creation time by 40% is now available for all EKS supported versions. This enables you to create an EKS control plane in 9 minutes on an average.
We are currently working on reducing this time further and we'll keep using this issue to track upcoming improvements.
@kirtichandak Is this improvement specific to certain regions ?
Is there any news (or another issue) on improving the control plane upgrade time? Taking in excess of 45 mins to upgrade a managed system that can be created from scratch in less than 10 mins isn't great and in practice is almost un-workable.
We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.
Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(
We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.
Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(
This is my observation as well.
I'm in the process of creating a 1.21 cluster and it took 14mins and 10secs to complete.
I think the 14 minutes time is for both control plane and worker nodes.
What would help greatly as well with EKS cluster rollouts are allowing concurrent operations on the cluster, e.g. creating 2 Fargate profiles, enabling control plane logging and an OIDC provider at the same time. Currently we have to use waiters in our CF stack code to create all these things sequentially.
Our focus recently has been reducing the time for cluster updates. We are in the process of rolling out changes that will reduce cluster version upgrade time down to ~12 minutes. After that completes, we'll roll out the same improved update workflow for OIDC provider associations and KMS encryption updates.
i just created a 1 node cluster (1.21) in Oregon, took 21 min end-to-end.
We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.
Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(
Seeing the same thing on our end. We're halfway into the idea of moving to Rancher entirely. :-\
Unfortunately for us, we're so invested in EKS and all the work we put into it, that sunk-cost keeps me from getting the OK by our CTO to make the jump.
Seeing creation times over 20 minutes all the time
@mikestef9 do you have any progress to report on the encryption upgrade times? We're seeing this consistently take over an hour for EKS v1.22 clusters.
16mins to create cluster and then another 10mins to get any nodes in.
I have been following this thread for almost 2 years and it does not appear EKS has gotten any better in creation times. I have a consistent 15 - 20 mins. Spinning up and tearing down EKS clusters is something that would go a long way for people to use it more. The overhead of running a cluster for extended periods of time make it so much of a headache that many of our clients decide its better to use competing tech.
I suspect AWS is not motivated to really reduce the time - if they did then people would start creating EKSes, do their work and tear them down.
@przemolb - I used to work at AWS. I can vouch for the fact that I never heard a single product team, or a single engineer ever say "let's do << X >> in order to lock the customer in/make it hard for them to achieve <<Y>>."
I think a primary reason for the slowdown in cluster provisioning is probably the AutoScaler API. The AutoScaler API and AutoScale Groups are a notoriously OLD and SLOW.
Two other components in the EKS stack that slow things down are: 1/ OIDC Provider creation/association, 2/ KMS Encryption & key association
IF you want to be able to create/destroy K8s clusters VERY quickly, while still benefiting from all the nice things EKS provides, you can create an EKS cluster, configure it to use karpenter instead of cluster-autoscaler and then COMBINE that architecture with VCluster.
This is the architecture I've developed for our Dev & QA environments.
- Create a "Dev" EKS cluster
- Install/configure Karpenter (it's straightforward and not difficult if you follow the docs and know what you're doing)
- Use VCluster to schedule/deploy "nested clusters" inside of it
When idle (or when no workloads/environments are scheduled), the Dev/QA EKS clusters run just 1 single server (cheap).
When we need to "spin up" a new "cluster" for a specific workload (or because we need another "environment" for testing), I use VCluster to schedule that into the EKS cluster.
Because Karpenter is blazing-fast, those nodes and containers usually come up within 58 seconds (yes, we timed them).
When done, we tear them down in seconds, and the EKS cluster goes back to just idling on a single small node.
This way, you can "create a cluster" - via VCluster - in less than 2 minutes, then destroy it in just a few minutes as well. The EKS cluster that hosts the other "nested" VCluster(s) can have a static nodegroup with 1 small node in it, to keep it running all the time, but keep it very cheap (monthly costs) as well.
Thanks @armenr for the hint about VCluster - it seems really good work around time requested to spin up a new EKS cluster.
In case this helps or is useful to anyone:
Launching EKS clusters in us-west-2 is averaging ~12 minutes this week.
Launching EKS v1.23 clusters in eu-west-1 is also averaging ~12 minutes this week.