ContosoTraders icon indicating copy to clipboard operation
ContosoTraders copied to clipboard

Fix AKS Scaling (Cluster Autoscaler)

Open mithunshanbhag opened this issue 2 years ago • 0 comments

We're investigating the following options for AKS scaling:

1. ACI VIRTUAL NODES

Status

Currently BLOCKED.

Where

Change Description

  • Redeployed AKS cluster via bicep template from mithun/hpa2 branch, which has the Azure CNI network policy (instead of the default kubenet policy).
  • Had to manually modify AKS's vnet to create a new subnet aci-subnet with address space 10.255.0.0/16.
  • Tethered it to existing AKS cluster using az aks enable-addons (full instructions here).
  • Applied the Deployment.yaml manifest from mithun/hpa2 branch, which has the nodeSelector, tolerations changes to configure pods to only run in virtual nodes.

Issue Details

The pods (configured to run in ACI virtual nodes) are stuck in waiting state.

image

The logs only show that an active endpoint is not being detected for the services / ingress

image

Hypothesis

  • Could have something to do with the fact that we switched over to Azure CNI network policy instead of the default kubenet policy.
  • Could have something to do with the nodeSelector, tolerations changes made in the Deployment.yaml file to configure pod to only run in virtual nodes.

2. CLUSTER AUTOSCALER

Status

Currently INVESTIGATING

Where

  • The changes are in my fork in [mithun/enable-autoscal](mithunshanbhag:mithun/cluster-autoscaler) branch (See PR microsoft/ContosoTraders#81)
  • Deployed in Jithin's MSDN subscription.

Change Description

  • Enable autoscaling with minCount: 1 and maxCount: 10

Issue Details

  • Load test has a high failure rate. This issue is being tracked separately in microsoft/Contoso-Traders-Cloud-Testing#3

    image

  • The pods are also not scaling out (this could be related to above issue).

    image

Hypothesis

Currently none, still investigating.

Misc Notes

Ingress controller was stuck in PENDING state for a few minutes after provisioning. Then automatically went to OK state.

image

mithunshanbhag avatar Mar 07 '23 15:03 mithunshanbhag