operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

[epic] v1.0.0 Performance and Scale

Open joelanford opened this issue 1 year ago • 11 comments

Epic Goal

  • Measure and implement any necessary improvements to ensure OLM v1.0.0 meets or exceeds OCP guidelines around performance and scalability.

Why is this important?

  • OLM v1.0.0 will be a payload component that always runs in OCP clusters. In order to reduce SD and customer costs, we need to minimize this overhead.
  • OLM v1.0.0 is intended to be used on a wide variety of clusters, ranging from single node clusters with just a few namespaces to clusters 2-3 orders of magnitude larger. We need make sure that it runs just as well on a small cluster as it does a large cluster.
  • In order to reduce user frustration, we need to provide a responsive user experience. Reconciliation needs to be fast and non-blocking to ensure users receive the experience they have come to expect from OCP. To the extent possible, long-running tasks (e.g. catalog fetching/caching and image pulling) should be performed asynchronously.

Scenarios

  1. Collect pprof profiles for CPU and memory when running standard user flows around installing, upgrading, and removing operators from public catalogs (e.g. operatorhub)
  2. Find the most resource intensive code paths. Provide documentation and recommendations related to making improvements in those areas.
  3. Coordinate with OLM maintainers to make improvements in areas deemed to provide the most significant performance and scale gain.
  4. Implement automated performance and scale regression tests in the existing upstream CI test suite.

Examples of known areas for improvement include:

  • When reconciling a ClusterExtension to resolve a bundle from the criteria provided by a user, the reconciler should return a desired bundle within 100ms and allocate no more memory than the size of the catalog metadata for the named spec.packageName.
  • When the ClusterExtension reconciler does not have the contents of a resolved image bundle available, it does not block waiting for the image to be pulled and processed. Rather, it starts an asychronous job, reports the pending image pull via the ClusterExtension status, and returns from reconcile.

joelanford avatar Jun 11 '24 14:06 joelanford

/assign

OchiengEd avatar Jun 18 '24 20:06 OchiengEd

I think I've found one unexpected slowdown: the bundle handler that converts a registry+v1 bundle to plain and then to helm. It takes 5s on my machine in the "Force upgrade" e2e test.

joelanford avatar Jul 14 '24 03:07 joelanford

@OchiengEd just wanted to check to make sure you didn't find any critical (as in "must fix for 1.0.0") issues in your performance and scale research?

If not, can we move the remaining scope of this epic to v1.x?

joelanford avatar Aug 14 '24 17:08 joelanford

Item to include in performance and scale (although maybe more of a release blocker based on other discussions):

  • [ ] https://github.com/operator-framework/operator-controller/issues/1025 (optimization)

EDIT: After discussion in the community meeting, this issue is not in scope for this epic.

everettraven avatar Aug 20 '24 14:08 everettraven

No critical issues were identified. This epic was slated to be moved to 1.x

OchiengEd avatar Oct 24 '24 15:10 OchiengEd

Lets re-access this and identify the acceptance criteria for this epic.

LalatenduMohanty avatar Oct 29 '24 15:10 LalatenduMohanty

We should remodel this issue to design the assessment/reporting infrastructure, with MVP implementation. However, it should include the scope of the new query catalogd web API discussed in #1607. We can create subsequent epics to give us measurable progress and continue to refine the implementation.

grokspawn avatar Jan 27 '25 03:01 grokspawn

Next step is to call a meeting and agree on the things we want to achieve with this epic. cc @dtfranz

LalatenduMohanty avatar Jan 28 '25 16:01 LalatenduMohanty

From the committee meeting, this is analogous to work that we're doing w.r.t. feature-gates, where we have

  1. a general framework which provides for measurement/assessment/detection capabilities in a standardized approach; and
  2. features which hook into the framework to inform their measurement/assessment/detection functionality, starting with #1607

grokspawn avatar Jan 28 '25 16:01 grokspawn

Latest Update:

To begin with, we'll be adding some instrumentation to the repository to enable us to begin collecting metrics and detecting alerts. Please see RFC here. ~~Once consensus is reached I'll create sub-issues so that we can begin work.~~

dtfranz avatar Mar 21 '25 00:03 dtfranz

I've moved this issue into "Needs Docs", as it has now been implemented. To view a performance snapshot, check the action summary for e2e or e2e-experimental. Additionally, e2e tests may be marked as failed by prometheus alerts. If this happens, the alerts can be found printed out at the top of the test summary.

dtfranz avatar Sep 02 '25 01:09 dtfranz