Autoscaler tests should be runnable under chaos
/area autoscale
Describe the feature
The autoscaler tests should be run with chaos duck enabled.
For now we disabled this because it causes flakes (see https://github.com/knative/serving/pull/10928), but in reality pods do crash and get upgraded in real systems and we should (a) have tests for this, (b) actually be resilient to this and as far as possible maintain correct behaviour.
Ideally, this will be achieved by making autoscaler HA (which should largely be done via #2930), such that guarantees can be kept even when autoscaler restarts. In some cases we may need to relax test guarantees to account for permitted/expected behaviour under chaos. In the worst case we may need to move some tests to a suite which does not run with chaos enabled (ideally not, though).
Here is a Potential Plan (will update/spin out issues based on discussion in this issue if folks have better ideas):
- [x] Stop the bleeding: turn off chaos in main suite so we know any new flakes are not due to this. https://github.com/knative/serving/pull/10928
- [ ] Investigate existing flakes in test/ha suite (eg https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-nightly-release/1368491671720824832 from nightlies)
- [ ] Check the most flaky tests - TestAutoscaleSustaining, TestRPSBasedAutoscaleUpCountPods - to see why they flake. Do we expect flakes even with HA enabled given our current contract? Fix tests if possible, or fix HA if we think we can fix it with better HA.
- [ ] Run variants of above tests in test/ha suite where we can explicitly kill things in the test rather than relying on chaos to do it. This should establish whether the tests will work under chaos.
- [ ] Enable Chaos again!
- [ ] If above step is not possible (eg if there are guarantees we want to test even though we do not expect them to work under chaos), create separate suite for these tests only. Ideally we do not do this because we've fixed everything to be actually HA or relaxed tests to reflect real guarantees.
- [ ] Enable Chaos again!
/assign @markusthoemmes @vagababov @yanweiguo for thoughts
/triage accepted
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
/remove-lifecycle stale