azure-cli icon indicating copy to clipboard operation
azure-cli copied to clipboard

az aks command invoke: does not work if user nodes have taints

Open jetnet opened this issue 3 years ago • 11 comments

Describe the bug

Command Name az aks command invoke -n $AKS_NAME -c "kubectl cluster-info"

Errors:

(KubernetesOperationError) Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).
Code: KubernetesOperationError
Message: Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).

Event Message:

0/3 nodes are available: 1 node(s) had untolerated taint {agentpool: user}, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

To Reproduce:

Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information.

  • create a user nodepool with a taint "agentpool=user:NoSchedule"
  • try to execute command:
  • az aks command invoke -n NAME -c "kubectl cluster-info"

Expected Behavior

aks command invoke should be able to start on system nodes with the default taint: CriticalAddonsOnly=true

Environment Summary

Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with, Alpine Linux v3.17
Python 3.10.9
Installer: PIP

azure-cli 2.44.1

Extensions:
account 0.2.5

Dependencies:
msal 1.20.0
azure-mgmt-resource 21.1.0b1

Additional Context

jetnet avatar Feb 02 '23 14:02 jetnet

route to CXP team

yonzhan avatar Feb 02 '23 15:02 yonzhan

@jetnet The underlying REST API for this command schedules a pod without any tolerations by default. Ideally, it would be best not to deploy non-critical workloads on a system node as it is possible that such workloads could starve resources from critical resources.

That being said, it would be best to create a feature request to add support for adding tolerations to unblock similar situations.

Since the Azure CLI itself doesn't have control over this, there is nothing that can be done in this context and should eventually get support when the underlying REST API supports it.

PramodValavala-MSFT avatar Feb 02 '23 19:02 PramodValavala-MSFT

Hi @jetnet. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost avatar Feb 02 '23 19:02 ghost

@PramodValavala-MSFT, really appreciate your clarification. Should I create a feature request or are you going to do that? Thanks!

jetnet avatar Feb 03 '23 08:02 jetnet

Hi @jetnet, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.

ghost avatar Feb 10 '23 10:02 ghost

/unresolve

I think, it's an issue with the current implementation and NOT a feature request. Look, you cannot run az command invoke if your AKS user nodes have a taint. It's not OK. Please re-open. Thanks!

jetnet avatar Feb 13 '23 06:02 jetnet

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/aks-pm.

Issue Details

Describe the bug

Command Name az aks command invoke -n $AKS_NAME -c "kubectl cluster-info"

Errors:

(KubernetesOperationError) Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).
Code: KubernetesOperationError
Message: Failed to run command due to cluster perf issue, container command-0be71db980254f398cdecce07419fbed in aks-command namespace did not start within 30s on your cluster, retry may helps. If issue persist, you may need to tune your cluster with better performance (larger node/paid tier).

Event Message:

0/3 nodes are available: 1 node(s) had untolerated taint {agentpool: user}, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

To Reproduce:

Steps to reproduce the behavior. Note that argument values have been redacted, as they may contain sensitive information.

  • create a user nodepool with a taint "agentpool=user:NoSchedule"
  • try to execute command:
  • az aks command invoke -n NAME -c "kubectl cluster-info"

Expected Behavior

aks command invoke should be able to start on system nodes with the default taint: CriticalAddonsOnly=true

Environment Summary

Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with, Alpine Linux v3.17
Python 3.10.9
Installer: PIP

azure-cli 2.44.1

Extensions:
account 0.2.5

Dependencies:
msal 1.20.0
azure-mgmt-resource 21.1.0b1

Additional Context

Author: jetnet
Assignees: -
Labels:

Service Attention, question, AKS, customer-reported, Service, needs-team-attention, Auto-Assign

Milestone: -

ghost avatar Mar 23 '23 16:03 ghost

@jetnet Apologies for the delay on this one! Since this requires a Service side change to support, I will be reassigning this case to the concerned team and sharing the feedback with them internally.

PramodValavala-MSFT avatar Mar 23 '23 16:03 PramodValavala-MSFT

Is there a workaround for this issue ?

mjnovice avatar Sep 20 '23 15:09 mjnovice

Any updates ?

mjnovice avatar Feb 01 '24 18:02 mjnovice

Any updates on this ?

mjnovice avatar May 02 '24 00:05 mjnovice

@PramodValavala-MSFT any updates on this ?

mjnovice avatar May 28 '24 20:05 mjnovice