infrared icon indicating copy to clipboard operation
infrared copied to clipboard

Suppress needless introspection re-attempts

Open juliakreger opened this issue 3 years ago • 2 comments

Introspection has undergone a number of improvements since the early days, including better alignment with Ironic itself, and ultimately fixes in underlying virtual machine configuration for when this tooling is used with VMs to help prevent some issues which do crop up when using VMs to pretend to be baremetal, specifically the enforcement of Spanning Tree.

When this legacy retry code gets invoked, a legitimate error has occured and needs to be investigated. We don't need this code thrasing the environment attempting to force the node through the process, as that just complicates troubleshooting.

This change adds inline notes, and explicitly excludes the retry if the version of OSP is >=16. If nodes have failed, the playbook now also fails the playbook run to enable the environment to be investigated as it failed as opposed to an artificially changed state.

juliakreger avatar Jun 01 '22 22:06 juliakreger

Right, we are aware of problematic introspection playbook which is being brought up from historical OSP7 times, where it compensated for buggy ironic state at that time and without the retries one basically could not pass introspection stage.

Otoh as introspection stage is critical stage common to all OSP deployments, past or current, afaik we can not affort such change introducing a risk to all CI deployments simultaneously. We plan to wait with such changes (also this one specifically, meaning introspection playbook refactoring) until IR gets osp-branched (which is WIP currently). Then we can finally draw a line between "legacy IR" and current IR state and start fixing all the techdebts in the code (without putting historical parts of CI at risk).

fhubik avatar Jun 16 '22 09:06 fhubik

IMHO, it is less of a critical stage component than most think. Kind of now only required in un-shipped OSP code for unrelated to deployment activities and for highly specific cases/data/tests (which have paths to navigate if data is not offered/discovered) in already shipped versions. Plus if we operate with the philosophy of fail fast, we will be able to better understand what is occurring leading to less wasted time when the jobs are reconfigured.

I think it is fine to wait, as long as we're aware and pay down this technical debt before we iterate again because it will save us time with each further iteration.

juliakreger avatar Jun 16 '22 10:06 juliakreger