ci-tools icon indicating copy to clipboard operation
ci-tools copied to clipboard

DPTP-2938: Allow pod scaler to reduce resources based on past 3 weeks of usage data

Open deepsm007 opened this issue 1 month ago • 16 comments

Add flags to enable/disable authoritative mode for CPU and memory resource requests separately. The pod-scaler can now decrease resource requests when authoritative mode is enabled (default: true for both), allowing gradual resource reduction based on recent usage data with safety limits (max 25% per cycle).

/cc @openshift/test-platform

deepsm007 avatar Jan 06 '26 13:01 deepsm007

Pipeline controller notification This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci-robot avatar Jan 06 '26 13:01 openshift-ci-robot

Walkthrough

Adds time-windowed (3-week) recommendation filtering with minimum-sample guards and minimum CPU/memory floors; replaces recommendation logic with authoritative-aware, gradual reductions (ignore <5% changes, cap 25% per cycle); extends admission/mutation APIs and flags; updates tests, fixtures, e2e env and logging.

Changes

Cohort / File(s) Summary
Admission & Mutation
cmd/pod-scaler/admission.go, cmd/pod-scaler/admission_test.go, cmd/pod-scaler/main.go, cmd/pod-scaler/testdata/...
Added authoritativeCPU and authoritativeMemory flags/fields and propagated through admit, podMutator, and mutatePodResources; renamed useOursIfLargerapplyRecommendationsBasedOnRecentData; implemented authoritative-aware reduction rules, adjusted logging, and updated tests/fixtures.
Resource Aggregation & Rules
cmd/pod-scaler/resources.go, cmd/pod-scaler/resources_test.go
Added resourceRecommendationWindow (3 weeks), minimums (minCPURequestMilli, minMemoryRequestBytes) and env-driven minSamplesForRecommendation; digestData now filters/weights recent fingerprints, skips when insufficient recent data, applies minima, and logs decisions.
Tests & Fixtures
cmd/pod-scaler/resources_test.go, cmd/pod-scaler/admission_test.go, cmd/pod-scaler/testdata/...
New tests for recency filtering and skipping when no recent data; updated admission tests for new API and behavior; fixtures updated to typed quantities and revised expected resource values.
E2E & Env
test/e2e/pod-scaler/run/consumer.go, test/e2e/pod-scaler/run/producer.go
Injected POD_SCALER_MIN_SAMPLES=1 into pod-scaler process environment for e2e runs.
Misc
.gitignore
Minor gitignore update to include /pod-scaler.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • [ ] 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Jan 06 '26 13:01 coderabbitai[bot]

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jan 06 '26 13:01 openshift-ci[bot]

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 06 '26 13:01 openshift-ci-robot

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 06 '26 15:01 openshift-ci-robot

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage
  • Adds pod classification system that labels pods as "normal" or "measured" based on whether they need fresh resource measurement data (measured if last measurement >10 days ago or never measured)
  • Implements podAntiAffinity rules to ensure measured pods run on isolated nodes with no other CI workloads, allowing accurate CPU/memory utilization measurement without node contention
  • Integrates BigQuery client to query and cache max CPU/memory utilization from measured pod runs, refreshing daily to keep data current
  • Applies measured resource recommendations only to the longest-running container in each pod, using actual utilization data instead of Prometheus metrics that may be skewed by node contention

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 06 '26 15:01 openshift-ci-robot

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 06 '26 16:01 openshift-ci-robot

/pipeline required

deepsm007 avatar Jan 08 '26 14:01 deepsm007

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 08 '26 14:01 openshift-ci-robot

/test images

deepsm007 avatar Jan 08 '26 14:01 deepsm007

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 09 '26 14:01 openshift-ci-robot

Scheduling required tests: /test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters: /test integration-optional-test

openshift-ci-robot avatar Jan 09 '26 14:01 openshift-ci-robot

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

In response to this:

Add flags to enable/disable authoritative mode for CPU and memory resource requests separately. The pod-scaler can now decrease resource requests when authoritative mode is enabled (default: true for both), allowing gradual resource reduction based on recent usage data with safety limits (max 25% per cycle).

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 12 '26 18:01 openshift-ci-robot

/hold This defaults to "on" right now for authoritativeCPU. We know there is a significant issue here until measurements do not pods running on CPU pressured worker nodes.

jupierce avatar Jan 12 '26 19:01 jupierce

/hold This defaults to "on" right now for authoritativeCPU. We know there is a significant issue here until measurements do not pods running on CPU pressured worker nodes.

Updated default to false "off"

deepsm007 avatar Jan 12 '26 19:01 deepsm007

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 76076807b0342d90d4c4d450006b33fba0dfcb9d link false /test breaking-changes
ci/prow/images 76076807b0342d90d4c4d450006b33fba0dfcb9d link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Jan 12 '26 20:01 openshift-ci[bot]