OCPBUGS-38466: Allow controller to continue when assisted-service
is unavailable in agent-based installs.
assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.
If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.
With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.
If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.
Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.
@rwsu: This pull request references Jira Issue OCPBUGS-38466, which is invalid:
- expected the bug to target the "4.18.0" version, but no target version was set
Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
In response to this:
is unavailable in agent-based installs.
assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.
If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.
With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.
If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.
Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/jira refresh
@rwsu: This pull request references Jira Issue OCPBUGS-38466, which is valid. The bug has been moved to the POST state.
3 validation(s) were run on this bug
- bug is open, matching expected state (open)
- bug target version (4.18.0) matches configured target version for branch (4.18.0)
- bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Requesting review from QA contact: /cc @mhanss
In response to this:
/jira refresh
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Codecov Report
Attention: Patch coverage is 42.69663% with 51 lines in your changes missing coverage. Please review.
Project coverage is 57.56%. Comparing base (
4635ddf) to head (65c541b). Report is 4 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/common/common.go | 54.38% | 22 Missing and 4 partials :warning: |
| ...ed-installer-controller/assisted_installer_main.go | 0.00% | 25 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #896 +/- ##
==========================================
+ Coverage 55.70% 57.56% +1.85%
==========================================
Files 15 15
Lines 3208 3464 +256
==========================================
+ Hits 1787 1994 +207
- Misses 1249 1290 +41
- Partials 172 180 +8
| Files with missing lines | Coverage Δ | |
|---|---|---|
| ...taller_controller/assisted_installer_controller.go | 75.91% <100.00%> (ø) |
|
| src/installer/installer.go | 75.64% <100.00%> (+6.50%) |
:arrow_up: |
| ...ed-installer-controller/assisted_installer_main.go | 27.10% <0.00%> (-2.50%) |
:arrow_down: |
| src/common/common.go | 43.61% <54.38%> (-1.55%) |
:arrow_down: |
I know that this use case is to cover the scenario where the user is using agent based install and there is a restart of assisted-service-controller after the reboot of the bootstrap node.
Surely this scenario can affect other installation modes also, are there any mitigations or is this out of scope here?
I know that this use case is to cover the scenario where the user is using agent based install and there is a restart of assisted-service-controller after the reboot of the bootstrap node.
Surely this scenario can affect other installation modes also, are there any mitigations or is this out of scope here?
@paul-maidment ,I believe it is out of scope for non-ABI installs. The patch is addressing a problem that is unique to ABI installs because assisted-service is hosted on the bootstrap node and it disappears after bootstrap reboots. This problem appears to occur rarely and thus far has only been reported by a single customer. If assisted-installer-controller restarts, because the bootstrap has rebooted, it can never connect to the assisted-service api and the controller fails hard without approving CSRs.
For saas, if assisted-installer-controller restarts, the controller should be able to reconnect to assisted-service. I presume assisted-service is accessible within the connection retry window if it some how becomes unreachable for a short span of time. So it is unaffected when the bootstrap node reboots.
/retest-required
@rwsu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/edge-e2e-metal-assisted-cnv-4-16 | 65c541b97c30a624878d538ab094485a4bcd24cf | link | false | /test edge-e2e-metal-assisted-cnv-4-16 |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rwsu, tsorya
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [tsorya]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/cherry-pick release-4.16
@rwsu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.
In response to this:
/cherry-pick release-4.16
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/retest-required
Remaining retests: 0 against base HEAD 51dc0145b7b27654e0ed5249d27727d82fd8b6c2 and 2 for PR HEAD 65c541b97c30a624878d538ab094485a4bcd24cf in total
@rwsu: Jira Issue OCPBUGS-38466: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-38466 has been moved to the MODIFIED state.
In response to this:
is unavailable in agent-based installs.
assisted-service runs on the bootstrap node in agent-based installs. The bootstrap node reboots after the control plane is available.
If the assisted-installer-controller restarts after the bootstrap node reboots, or for some reason the controller is never able to contact assisted-service, the controller loops waiting or assisted-service to become available, times out, and exits.
With compact clusters, because the controller exited and is unable to approve CSRs, the third control plane node is unable to join the cluster causing the cluster installation to fail.
If the invoker is agent-installer, instead of exiting, this patch allows the controller to continue to run when assisted-service is offline.
Because assisted-service may be unavailable, HasValidvSphereCredentials has been updated to also look at the install-config to determine if credentials were set. Because username and password are redacted, only the server name is used to determine if valid credentials were provided.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
@rwsu: new pull request created: #914
In response to this:
/cherry-pick release-4.16
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[ART PR BUILD NOTIFIER]
Distgit: ose-agent-installer-orchestrator This PR has been included in build ose-agent-installer-orchestrator-container-v4.18.0-202410022312.p0.gb1317ba.assembly.stream.el9. All builds following this will include this PR.
[ART PR BUILD NOTIFIER]
Distgit: ose-agent-installer-csr-approver This PR has been included in build ose-agent-installer-csr-approver-container-v4.18.0-202410022312.p0.gb1317ba.assembly.stream.el9. All builds following this will include this PR.
/cherry-pick release-4.17
@rwsu: new pull request created: #918
In response to this:
/cherry-pick release-4.17
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.