AKS icon indicating copy to clipboard operation
AKS copied to clipboard

[BUG] Intermittent `OSProvisioningTimedOut` errors when creating VMs for AKS cluster

Open Tsonov opened this issue 1 year ago • 4 comments

Describe the bug

Our scenario:

  • We create VMs for customers and join them to their AKS clusters to scale up capacity as needed.
  • We use the AKS community images for the nodes (example gallery for westeurope: /CommunityGalleries/aksubuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/Images/2204gen2containerd). We pick the latest published there.
  • The issue seems to happen in multiple regions (eastus2, westeurope, centralus) so we don't have a correlation there.

The problem is that sometimes we get the error below and we are not sure how to detect or react to this properly. As it takes ~20 minutes for the error to appear, it can cause scaling to be very slow and lead to downtime for customers. We get this error around 3-10 times per day (and we add nodes roughly every 1-2 seconds) so it is very low volume. But we still want to understand how to fix or work around it.

https://management.azure.com/subscriptions/xxxx/providers/Microsoft.Compute/locations/centralus/operations/xxxx
--------------------------------------------------------------------------------
RESPONSE 200: 200 OK
ERROR CODE: OSProvisioningTimedOut
--------------------------------------------------------------------------------
{
  \"startTime\": \"2024-09-18T23:49:48.4348637+00:00\",
  \"endTime\": \"2024-09-19T00:10:01.2801821+00:00\",
  \"status\": \"Failed\",
  \"error\": {
    \"code\": \"OSProvisioningTimedOut\",
    \"message\": \"OS Provisioning for VM 'XXXXXX' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\
 * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\
 * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \\r\
 * If you are deploying more than 20 Virtual Machines concurrently, consider moving your custom image to shared image gallery. Please refer to https://aka.ms/movetosig for the same.\",
    \"target\": \"0\"
  },
  \"name\": \"xxxx\"
}
--------------------------------------------------------------------------------

The question is:

  • Can we improve something on our side to reduce the likelihood of this error?
  • Can we specify some timeout to avoid waiting for 20minutes before getting this error?
  • Is it a known issue on Azure side in general?

To Reproduce Unfortunately, we cannot reproduce the error reliably.

Expected behavior No error when provisioning or clear instructions how to avoid long timeout waiting.

Screenshots N/A

Environment (please complete the following information): N/A

  • CLI Version - N/A
  • Kubernetes version - multiple
  • CLI Extension version - N/A
  • Browser - N/A

Additional context N/A

Tsonov avatar Sep 19 '24 10:09 Tsonov

Are you not using cluster auto scaler or node auto provision to add the nodes?

PixelRobots avatar Sep 19 '24 17:09 PixelRobots

No, we create and join the nodes to the cluster ourselves. The approach is similar to karpenter for azure (and it uses the same vm images), but not identical. VMs are created as single-node VMSS.

On Thu, Sep 19, 2024 at 20:34 Richard Hooper @.***> wrote:

Are you not using cluster auto scaler or node auto provision to add the nodes?

— Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/4553#issuecomment-2361799064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANTHFISV2DEB6WG3MB2HPTZXMDKRAVCNFSM6AAAAABOPUKM5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRG44TSMBWGQ . You are receiving this because you authored the thread.Message ID: @.***>

Tsonov avatar Sep 19 '24 18:09 Tsonov

It sounds like what you are doing is not a supported way to add nodes to an AKS cluster. Have you opened a support ticket? It might be a VM compute issue rather than an AKS issue.

PixelRobots avatar Sep 20 '24 12:09 PixelRobots

Good call, we opened a support ticket.

It's possible this is not the right repo to log the issue - not sure if the issue comes from the VHD image build (which comes from https://github.com/Azure/AgentBaker perhaps?), the Azure Compute that provisions VM or the fact that the node is trying to join an AKS cluster.

Tsonov avatar Sep 20 '24 13:09 Tsonov

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

Issue needing attention of @Azure/aks-leads

Issue needing attention of @Azure/aks-leads

Closing this, we added proper timeouts to catch OS failed provisioning and retries. From azure support ticket, it looks like an intermittent issue that can sometime happen when provisioning a VM.

Tsonov avatar Dec 04 '24 12:12 Tsonov