firecracker tests: parametrize bench mark tests

In the previous implementation, it was necessary to adjust the timeout value every time a benchmark test added. By parametrizing the benchmark tests, the time required for each test becomes predictable, eliminating the need to adjust the timeout value

Changes

Parametrize the test by the list of criterion benchmarks.

By parametrizing the tests, git clone will executed for each parameter in here.

https://github.com/firecracker-microvm/firecracker/blob/70ac1544d3bb460c47d93132f59a9e35cd5ad3da/tests/framework/ab_test.py#L85-L98

To run all parametrized tests with single git close would require major revisions to git_ab_test, so this PR does not address that issue.

Reason

close #4832

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check CONTRIBUTING.md.

PR Checklist

[x] I have read and understand CONTRIBUTING.md.
[x] I have run tools/devtool checkstyle to verify that the PR passes the automated style checks.
[x] I have described what is done in these changes, why they are needed, and how they are solving the problem in a clear and encompassing way.
[ ] I have updated any relevant documentation (both in code and in the docs) in the PR.
[ ] I have mentioned all user-facing changes in CHANGELOG.md.
[x] If a specific issue led to this PR, this PR closes the issue.
[ ] When making API changes, I have followed the Runbook for Firecracker API changes.
[x] I have tested all new and changed functionalities in unit tests and/or integration tests.
[ ] I have linked an issue to every new TODO.

[ ] This functionality cannot be added in rust-vmm.

Dec 30 '24 06:12 cm-iwata

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.06%. Comparing base (ae078ee) to head (d1226ab). Report is 136 commits behind head on main.

:exclamation: Current head d1226ab differs from pull request most recent head 420a113

Please upload reports for the commit 420a113 to get more accurate results.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4974   +/-   ##
=======================================
  Coverage   83.06%   83.06%           
=======================================
  Files         250      250           
  Lines       26897    26897           
=======================================
  Hits        22342    22342           
  Misses       4555     4555

Flag	Coverage Δ
5.10-c5n.metal	`83.56% <ø> (ø)`
5.10-m5n.metal	`83.56% <ø> (-0.01%)`	:arrow_down:
5.10-m6a.metal	`82.79% <ø> (-0.01%)`	:arrow_down:
5.10-m6g.metal	`79.34% <ø> (ø)`
5.10-m6i.metal	`83.55% <ø> (-0.01%)`	:arrow_down:
5.10-m7a.metal-48xl	`82.77% <ø> (-0.01%)`	:arrow_down:
5.10-m7g.metal	`79.34% <ø> (ø)`
5.10-m7i.metal-24xl	`83.51% <ø> (-0.01%)`	:arrow_down:
5.10-m7i.metal-48xl	`83.52% <ø> (ø)`
5.10-m8g.metal-24xl	`79.34% <ø> (ø)`
5.10-m8g.metal-48xl	`79.34% <ø> (ø)`
6.1-c5n.metal	`83.61% <ø> (ø)`
6.1-m5n.metal	`83.61% <ø> (ø)`
6.1-m6a.metal	`82.84% <ø> (+<0.01%)`	:arrow_up:
6.1-m6g.metal	`79.34% <ø> (ø)`
6.1-m6i.metal	`83.60% <ø> (+<0.01%)`	:arrow_up:
6.1-m7a.metal-48xl	`82.82% <ø> (ø)`
6.1-m7g.metal	`79.34% <ø> (ø)`
6.1-m7i.metal-24xl	`83.61% <ø> (-0.01%)`	:arrow_down:
6.1-m7i.metal-48xl	`83.62% <ø> (ø)`
6.1-m8g.metal-24xl	`79.34% <ø> (ø)`
6.1-m8g.metal-48xl	`79.34% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jan 08 '25 09:01 codecov[bot]

Mh, it would seem that it's still compiling everything multiple times judging by the test runtimes, and that's why its timing out? :thinking:

Apr 28 '25 14:04 roypat

Mh, it would seem that it's still compiling everything multiple times judging by the test runtimes, and that's why its timing out? 🤔

@roypat Sorry for the late reply. After investigation, I found that the timeout threshold was too short. Let's take the following build failure log as an example.

https://buildkite.com/firecracker/firecracker-pr-optional/builds/11752#01967b0d-070f-4a47-ab7c-8aa74ec201ba

This log contains the following log:

419.83s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault]
178.50s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_16]
33.90s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_add_used_256]
31.28s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[queue_pop_16]
31.24s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[request_parse]
30.96s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[next_descriptor_16]
24.40s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[deserialize_cpu_template]
20.43s call integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[serialize_cpu_template]
0.17s setup integration_tests/performance/test_benchmarks.py::TestBenchMarks::test_no_regression_relative_to_target_branch[page_fault]
0.09s call integration_tests/functional/test_cpu_template_helper.py::test_guest_cpu_config_change

First, page_fault parametrized test is run. Since this is the first test run, a process to build a binary for benchmark tests.
These processes do not complete within 420 seconds, so the test fails and the binary build is interrupted.
Next, queue_add_used_16 is run.
Since the binary has not yet been built, cargo will build binary, but since there is a build cache generated before the page_fault test run, this build completes relatively quickly. Then queue_add_used_16 completes in 178.5s.
Next, queue_add_used_256 is run. Since there is a pre-built binary, this test completes quickly in about 30s. Subsequent tests are similar.

I was able to reproduce the same issue locally, so I increased the timeout threshold. Could you run CI again?

Jun 22 '25 13:06 cm-iwata

We decided to remove the automated criterion benchmark runs (#5273) as they've been causing more pain with flakiness than they've been useful. So this PR will no longer be relevant

Jun 26 '25 09:06 Manciukic