assisted-installer icon indicating copy to clipboard operation
assisted-installer copied to clipboard

OCPBUGS-38466: Allow controller to continue when assisted-service

Open rwsu opened this issue 1 year ago • 9 comments

is unavailable in agent-based installs.

assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.

If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.

With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.

If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.

Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.

rwsu avatar Sep 03 '24 21:09 rwsu

@rwsu: This pull request references Jira Issue OCPBUGS-38466, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

is unavailable in agent-based installs.

assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.

If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.

With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.

If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.

Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 03 '24 21:09 openshift-ci-robot

/jira refresh

rwsu avatar Sep 03 '24 21:09 rwsu

@rwsu: This pull request references Jira Issue OCPBUGS-38466, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @mhanss

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Sep 03 '24 22:09 openshift-ci-robot

Codecov Report

Attention: Patch coverage is 42.69663% with 51 lines in your changes missing coverage. Please review.

Project coverage is 57.56%. Comparing base (4635ddf) to head (65c541b). Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/common/common.go 54.38% 22 Missing and 4 partials :warning:
...ed-installer-controller/assisted_installer_main.go 0.00% 25 Missing :warning:
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #896      +/-   ##
==========================================
+ Coverage   55.70%   57.56%   +1.85%     
==========================================
  Files          15       15              
  Lines        3208     3464     +256     
==========================================
+ Hits         1787     1994     +207     
- Misses       1249     1290      +41     
- Partials      172      180       +8     
Files with missing lines Coverage Δ
...taller_controller/assisted_installer_controller.go 75.91% <100.00%> (ø)
src/installer/installer.go 75.64% <100.00%> (+6.50%) :arrow_up:
...ed-installer-controller/assisted_installer_main.go 27.10% <0.00%> (-2.50%) :arrow_down:
src/common/common.go 43.61% <54.38%> (-1.55%) :arrow_down:

codecov[bot] avatar Sep 03 '24 22:09 codecov[bot]

I know that this use case is to cover the scenario where the user is using agent based install and there is a restart of assisted-service-controller after the reboot of the bootstrap node.

Surely this scenario can affect other installation modes also, are there any mitigations or is this out of scope here?

paul-maidment avatar Sep 08 '24 10:09 paul-maidment

I know that this use case is to cover the scenario where the user is using agent based install and there is a restart of assisted-service-controller after the reboot of the bootstrap node.

Surely this scenario can affect other installation modes also, are there any mitigations or is this out of scope here?

@paul-maidment ,I believe it is out of scope for non-ABI installs. The patch is addressing a problem that is unique to ABI installs because assisted-service is hosted on the bootstrap node and it disappears after bootstrap reboots. This problem appears to occur rarely and thus far has only been reported by a single customer. If assisted-installer-controller restarts, because the bootstrap has rebooted, it can never connect to the assisted-service api and the controller fails hard without approving CSRs.

For saas, if assisted-installer-controller restarts, the controller should be able to reconnect to assisted-service. I presume assisted-service is accessible within the connection retry window if it some how becomes unreachable for a short span of time. So it is unaffected when the bootstrap node reboots.

rwsu avatar Sep 12 '24 20:09 rwsu

/retest-required

rwsu avatar Sep 20 '24 18:09 rwsu

@rwsu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-cnv-4-16 65c541b97c30a624878d538ab094485a4bcd24cf link false /test edge-e2e-metal-assisted-cnv-4-16

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Sep 20 '24 20:09 openshift-ci[bot]

/lgtm

tsorya avatar Sep 25 '24 08:09 tsorya

/approve

tsorya avatar Oct 02 '24 14:10 tsorya

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rwsu, tsorya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Oct 02 '24 14:10 openshift-ci[bot]

/cherry-pick release-4.16

rwsu avatar Oct 02 '24 14:10 rwsu

@rwsu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/retest-required

Remaining retests: 0 against base HEAD 51dc0145b7b27654e0ed5249d27727d82fd8b6c2 and 2 for PR HEAD 65c541b97c30a624878d538ab094485a4bcd24cf in total

openshift-ci-robot avatar Oct 02 '24 16:10 openshift-ci-robot

@rwsu: Jira Issue OCPBUGS-38466: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38466 has been moved to the MODIFIED state.

In response to this:

is unavailable in agent-based installs.

assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.

If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.

With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.

If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.

Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 02 '24 19:10 openshift-ci-robot

@rwsu: new pull request created: #914

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-orchestrator This PR has been included in build ose-agent-installer-orchestrator-container-v4.18.0-202410022312.p0.gb1317ba.assembly.stream.el9. All builds following this will include this PR.

openshift-bot avatar Oct 02 '24 23:10 openshift-bot

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-csr-approver This PR has been included in build ose-agent-installer-csr-approver-container-v4.18.0-202410022312.p0.gb1317ba.assembly.stream.el9. All builds following this will include this PR.

openshift-bot avatar Oct 02 '24 23:10 openshift-bot

/cherry-pick release-4.17

rwsu avatar Oct 07 '24 21:10 rwsu

@rwsu: new pull request created: #918

In response to this:

/cherry-pick release-4.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.