When is criterion stable between runs, how can it be made more so?

Open rrnewton opened this issue 9 years ago • 1 comments

Criterion gives highly precise measurements. Given two measurements of a simple microbenchmark A, A1,A2, if:

both were taking starting from a similar machine state,
both report very high R^2 values, and
both were run for a long -L20 or higher,

then A1/A2 should be close estimates, right? No, unfortunately.

When measuring 1279 benchmarks on Stackage, we have found that it's very common to have greater than 10% variation between consecutive runs of the same, small, deterministic benchmark.

Anecdotally, we seem to get more stable numbers from individual high --iters runs, than from linear regression. I don't have a good explanation yet. Perhaps the non-determinism in the selection of data points (on the X axis) is having more of an effect than we expected? Certainly, when there is a bad R^2, we've seen exactly where it starts running has a big effect.

@RyanGlScott and @vollmerm have been working on this.

(On a related note, it would be great to have some assistance when using criterion with the kinds of things Krun controls, like waiting for the machine to cool down to a baseline temperature before starting a run.)

Aug 16 '16 14:08 rrnewton

we seem to get more stable numbers from individual high --iters runs, than from linear regression.

This matches my experience. I've done a simple experiment as an example: 60 benchmarks of bash -c "a=0; for i in {1..500000}; do (( a += RANDOM )); done" with the bench tool (which uses criterion under the hood). The code is on gist. Results from a computer without CPU frequency scaling, nonessential daemons or a desktop environment running:

boxplot

	Interquartile range / Median	Range / Median
Least-squares slope	0.3%	1.7%
Theil-Sen slope	0.3%	1.2%
Mean	0.2%	0.9%
Median of means	0.2%	0.9%
Minimum of means	0.1%	0.5%
Quartile 1 of means	0.1%	0.5%
Quartile 3 of means	0.3%	1.0%

(Note that the minimum, median and quartiles are not of individual runs, but of the mean loop iteration times. Getting the true quartiles is currently impossible — https://github.com/bos/criterion/issues/165 tracks this.)

I remember the relative reliability of various statistics in other, less contrived benchmarks being similar to this. So while R² can be useful for checking whether there are anomalies, the slope from linear regression seems useless since the mean provides the same information but with much less variation between benchmarks. Am I missing something?

Apr 16 '18 22:04 pkkm