containers-roadmap icon indicating copy to clipboard operation
containers-roadmap copied to clipboard

[EKS] : Reduction in EKS cluster creation time

Open kirtichandak opened this issue 5 years ago • 41 comments

EKS cluster control plane provisioning time currently averages 15 minutes. We’ll use this issue to track the ongoing improvements we are making to reduce the creation time.

Which service(s) is this request for? EKS

kirtichandak avatar Jan 15 '21 23:01 kirtichandak

Reducing the upgrade time would also be nice, especially since the upgrade today involves several manual steps, see https://github.com/aws/containers-roadmap/issues/600 . When I recently upgraded to 1.15 and 1.16, I remember something like 40, 45min per control plane upgrade until EKS reported that the upgrade had fully finished.

If upgrades could be faster, with the same reliability, this would be great. And it might be even more important than the creation time, assuming that clusters are upgraded several times in their lifetime.

heidemn avatar Jan 16 '21 22:01 heidemn

Upgrading from 1.18 to 1.19 today took 47 mins according to Terraform logs for the control plane, then 34 mins to upgrade a single very small nodegroup (7 nodes), then you have to manually update core-dns, kube-proxy and aws-cni.

So overall you're talking an absolute minimum of 1.5 hours if you're watching the thing like a hawk and not wasting any time with gaps. This does seem a little crazy and unsustainable :\

I do hope it doesn't keep getting worse with future versions too...

billinghamj avatar Feb 20 '21 12:02 billinghamj

@billinghamj thanks for the feedback. You are being heard. On a tangent (and not as a mean to respond to your specific question/need) I am wondering if you have considered using Fargate in your EKS deployments. Among other advantages, one thing that in the context of this thread would be interesting is the fact there are no nodes to upgrade and also that AWS embeds, as part of the Fargate, service components you don't need to care about (kubelet/proxy, cni, log routers, etc). In your extremely specific example you would "just" had to upgrade the control plane and core-dns. Just curious if you took Fargate into account and, if you did, what made you stick to "regular" EC2 worker nodes.

mreferre avatar Feb 21 '21 10:02 mreferre

@mreferre in my opinion, running Fargate in EKS is not cost-effective:

  • EKS adds overhead cost for the control plane,
  • Fargate makes compute power more expensive, compared to EC2.

I don't think there's much of a use case to run the main EKS workloads in Fargate. Maybe it can be used for small tools (e.g. cron jobs), but not for apps that need a lot of CPU.

Side note: Price reductions of any kind are always welcome :-)

heidemn avatar Feb 21 '21 18:02 heidemn

Thanks for the feedback @heidemn. The raw compute costs of Fargate are (on average) only roughly 20% more expensive than standard EC2 prices (after a consistent price reduction we announced a while back).

Ironically, I think that for very tiny/small workloads, Fargate isn't very cost effective given that the smallest pod size is 0.25vCPU/512MB of memory (and in most cases it would be more convenient to consolidate many tiny workloads on EC2 instances). However for larger workloads, assuming your pod utilization is high, Fargate may become cost effective pretty quickly given there are no worker nodes waste (most K8s clusters are utilized only at a fraction of their full capacity, which you are paying for). If you have a real life example of where you concluded that Fargate was more expensive could you share it please (even offline, I can be reached at mreferre @ amazon dot com).

Also, I did not mean this to become a distraction from the original question in this thread.

Thanks!

PS here a few more considerations around EKS/Fargate economics. I'd like to understand where these assumptions aren't correct (we want to learn more about practical cases where these assumptions are not applicable).

mreferre avatar Feb 21 '21 18:02 mreferre

Thanks, that blog post is definitely helpful. I will give Fargate on EKS a try soon.

To close this (off-)topic: What I could imagine is that starting a Pod might be slower on Fargate than on EC2 (if the instance is already running). But the benefits of better isolation and not having to maintain servers are definitely not bad.

heidemn avatar Feb 21 '21 21:02 heidemn

Our services tend to use around 30MB RAM, are IO bound, and we run hundreds/thousands of them on our non-prod cluster.

Even when using exclusively ARM instance types on spot, due to the pods-per-node limits, EKS is doing pretty poorly for us right now cost-wise (with Rancher, we ran the entire thing on a single instance with no performance issues). Obviously Fargate would exacerbate this massively

Aside from that, we generally are happier having a bit more control, and want as close as possible to a "plain vanilla" K8s setup. History has told us not to trust AWS too much when it comes to behind-the-scenes magic. Before managed spot instances were available, we were quite happy with self-managed nodes too

On principle, lack of ARM support is also a complete non-starter for us. We want to push for that future hard, so we're voting with our feet

billinghamj avatar Feb 21 '21 23:02 billinghamj

@billinghamj this makes a lot of sense. Thanks for the feedback.

mreferre avatar Feb 22 '21 08:02 mreferre

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

kirtichandak avatar Feb 25 '21 20:02 kirtichandak

@kirtichandak Does this just apply to cluster creation or are some of these improvements going to be seen in upgrading a cluster version?

mveitas avatar Feb 25 '21 21:02 mveitas

The change to reduce control plane creation time by 40% is now available for all EKS supported versions. This enables you to create an EKS control plane in 9 minutes on an average.

We are currently working on reducing this time further and we'll keep using this issue to track upcoming improvements.

kirtichandak avatar Mar 19 '21 18:03 kirtichandak

@kirtichandak Is this improvement specific to certain regions ?

kkapoor1987 avatar Apr 22 '21 12:04 kkapoor1987

Is there any news (or another issue) on improving the control plane upgrade time? Taking in excess of 45 mins to upgrade a managed system that can be created from scratch in less than 10 mins isn't great and in practice is almost un-workable.

stevehipwell avatar Jun 21 '21 16:06 stevehipwell

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

mhulscher avatar Oct 14 '21 11:10 mhulscher

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

This is my observation as well.

przemolb avatar Nov 04 '21 16:11 przemolb

I'm in the process of creating a 1.21 cluster and it took 14mins and 10secs to complete.

samsen1 avatar Dec 07 '21 04:12 samsen1

I think the 14 minutes time is for both control plane and worker nodes.

przemolb avatar Dec 07 '21 17:12 przemolb

What would help greatly as well with EKS cluster rollouts are allowing concurrent operations on the cluster, e.g. creating 2 Fargate profiles, enabling control plane logging and an OIDC provider at the same time. Currently we have to use waiters in our CF stack code to create all these things sequentially.

gbvanrenswoude avatar Jan 15 '22 18:01 gbvanrenswoude

Our focus recently has been reducing the time for cluster updates. We are in the process of rolling out changes that will reduce cluster version upgrade time down to ~12 minutes. After that completes, we'll roll out the same improved update workflow for OIDC provider associations and KMS encryption updates.

mikestef9 avatar Feb 22 '22 16:02 mikestef9

i just created a 1 node cluster (1.21) in Oregon, took 21 min end-to-end.

cdharma avatar Mar 22 '22 07:03 cdharma

We have now introduced a change for 1.19 clusters that reduces control plane creation time by 40%, enabling you to create an EKS control plane in 9 minutes on an average. The improvement is also coming to other supported Kubernetes versions in a few weeks.

Has there been a regression? It seems that all my cluster creations at 1.21 take approximately 14 minutes :(

Seeing the same thing on our end. We're halfway into the idea of moving to Rancher entirely. :-\

Unfortunately for us, we're so invested in EKS and all the work we put into it, that sunk-cost keeps me from getting the OK by our CTO to make the jump.

armenr avatar May 04 '22 14:05 armenr

Seeing creation times over 20 minutes all the time

matti avatar May 04 '22 16:05 matti

@mikestef9 do you have any progress to report on the encryption upgrade times? We're seeing this consistently take over an hour for EKS v1.22 clusters.

stevehipwell avatar Aug 11 '22 10:08 stevehipwell

16mins to create cluster and then another 10mins to get any nodes in.

matti avatar Aug 18 '22 12:08 matti

I have been following this thread for almost 2 years and it does not appear EKS has gotten any better in creation times. I have a consistent 15 - 20 mins. Spinning up and tearing down EKS clusters is something that would go a long way for people to use it more. The overhead of running a cluster for extended periods of time make it so much of a headache that many of our clients decide its better to use competing tech.

hangtime79 avatar Sep 29 '22 01:09 hangtime79

I suspect AWS is not motivated to really reduce the time - if they did then people would start creating EKSes, do their work and tear them down.

przemolb avatar Sep 29 '22 09:09 przemolb

@przemolb - I used to work at AWS. I can vouch for the fact that I never heard a single product team, or a single engineer ever say "let's do << X >> in order to lock the customer in/make it hard for them to achieve <<Y>>."

I think a primary reason for the slowdown in cluster provisioning is probably the AutoScaler API. The AutoScaler API and AutoScale Groups are a notoriously OLD and SLOW.

Two other components in the EKS stack that slow things down are: 1/ OIDC Provider creation/association, 2/ KMS Encryption & key association

IF you want to be able to create/destroy K8s clusters VERY quickly, while still benefiting from all the nice things EKS provides, you can create an EKS cluster, configure it to use karpenter instead of cluster-autoscaler and then COMBINE that architecture with VCluster.

This is the architecture I've developed for our Dev & QA environments.

  1. Create a "Dev" EKS cluster
  2. Install/configure Karpenter (it's straightforward and not difficult if you follow the docs and know what you're doing)
  3. Use VCluster to schedule/deploy "nested clusters" inside of it

When idle (or when no workloads/environments are scheduled), the Dev/QA EKS clusters run just 1 single server (cheap).

When we need to "spin up" a new "cluster" for a specific workload (or because we need another "environment" for testing), I use VCluster to schedule that into the EKS cluster.

Because Karpenter is blazing-fast, those nodes and containers usually come up within 58 seconds (yes, we timed them).

When done, we tear them down in seconds, and the EKS cluster goes back to just idling on a single small node.

This way, you can "create a cluster" - via VCluster - in less than 2 minutes, then destroy it in just a few minutes as well. The EKS cluster that hosts the other "nested" VCluster(s) can have a static nodegroup with 1 small node in it, to keep it running all the time, but keep it very cheap (monthly costs) as well.

armenr avatar Sep 30 '22 11:09 armenr

Thanks @armenr for the hint about VCluster - it seems really good work around time requested to spin up a new EKS cluster.

przemolb avatar Sep 30 '22 15:09 przemolb

In case this helps or is useful to anyone:

Launching EKS clusters in us-west-2 is averaging ~12 minutes this week.

armenr avatar Oct 13 '22 11:10 armenr

Launching EKS v1.23 clusters in eu-west-1 is also averaging ~12 minutes this week.

stevehipwell avatar Oct 13 '22 15:10 stevehipwell