rustc-perf Clarify and Update Meaning of Collection Table

This issue is an extension of https://github.com/rust-lang/rustc-perf/issues/937 which aims to add more clarity to the database schema.

This issue uses the new glossary for terms. Please make sure to first refer to that glossary.

Status Quo

Currently, the collection table is used to identify a single test.

For example, given an artifact, a benchmark "hello world", the profile "debug", and the "empty" incremental cache state scenario, the collector collects many different statistics. All those statistics are identified by 1 collection. If the same test happens again, a different collection record is created.

However, we currently completely ignore this table (it is never used in any read query - only inserts).

It seems that if we run many different test iterations a different collection is created for each iteration.

Questions

Is this useful? Do we care about test results generated from different tests for the same test case and artifact combination?

Jul 26 '21 17:07 rylev

As far as I can tell, multiple collections are used in at least one minor way. When stat values are obtained from the pstat and rustc_compilation tables (in Connection::get_pstats and Connection::get_bootstrap), the minimum values are taken across all collections for the test case and artifact combination. (Edit: just noticed that getting self-profile stats from the self_profile_query table are handled similarly.) Though it looks like only one collection is done for the "rustc" benchmark, currently.

I would have expected the average to be used, rather than the minimum, personally. But either way, I also wonder if the benefits of multiple collections outweigh the costs.

Jul 26 '21 20:07 tgnottingham

The minimum is "standard practice" (in the sense that a few posts online claimed so when I was doing research, and a few people indicated similarly) since it reduces noise by finding the best possible variant.

I don't think we have practical data on whether multiple collections are beneficial in terms of actually reducing noise, though. Maybe they're not.

The collection table was intended so that we could know if a comparison is made across different versions of the underlying collector, and the idea was that we could even automatically rerun known interesting comparisions (e.g., the base commit for any try builds would be scheduled for running on the new collector version if the data we currently have is only for the old one). This is, obviously, not currently implemented.

The secondary goal is that currently, we show minimum per metric for a given run. That means that you can have an instruction count from one build and the max-rss from a completely different build, for example, with no indication on the UI. This may be undesirable, the collection table would allow at least detection and possibly "fixing" this entirely.

Ultimately, it's quite possible we could apply min/max/avg at collection time across N runs and then store a final single value into the table. Maybe that's better. One of my pipe dreams was to do std. dev. comparisons similar to hyperfine or similar tooling within a particular artifact -- not like the recently implemented over time comparisons. But that's likely never going to be feasible, it'd take too long in practice I'd guess.

Jul 26 '21 21:07 Mark-Simulacrum

I would have expected the average to be used, rather than the minimum, personally. But either way, I also wonder if the benefits of multiple collections outweigh the costs.

First, let's not use this issue to discuss whether multiple runs are appropriate and whether the minimum or average should be used. If we feel the status quo is inappropriate we can open another issue. This issue is purely about the existence of the collection table.

the minimum values are taken across all collections for the test case and artifact combination.

Indeed, but I don't think this requires a separate table. The query used in get_pstats does not reference the collection table at all. Test case (currently modeled roughly as "series") and aid (artifact id) combinations are not forced to be unique and it would be fine for their to be more than 1 of the same combination.

So your statement above is true, and I believe how we want things to work, but it doesn't require the existence of a separate collection table.

@Mark-Simulacrum seems to indicate that both the use cases for the collection table are not currently implemented. This is totally fine, but it's good to know that indeed it seems like the collection table as it currently stands is not actually used for anything.

@wesleywiser and I were discussing extending this table in order to indicate things like the current environment that a collector runs in (e.g., linux vs windows). In order to do this, it's important to know what the table is currently used for.

Suggestion

We may think about renaming this table to tests (again refer to the linked glossary). This table could then be used to indicate an actual instance of test being run which could include information like the version of the collector, the operating system it's running on (though this would likely be a join with a different table), etc.

Jul 27 '21 09:07 rylev

I'd be fine with renaming.

Test case (currently modeled roughly as "series") and aid (artifact id) combinations are not forced to be unique and it would be fine for their to be more than 1 of the same combination.

To be clear, for the majority of benchmarks, (test case, aid) is not a unique identifier today. You definitely need to add the collection id to make it unique.

As far as I know, without the collection table there's no way to get the statistics for all the metrics we collect from a specific run. This means that if you're trying to do so, you do need that external data. I'm not sure, but it sounds like this might not be something well understood just yet?

Jul 27 '21 10:07 Mark-Simulacrum

without the collection table there's no way to get the statistics for all the metrics we collect from a specific run

This is correct though there is not code that takes advantage of this fact. We'll likely want to keep the table around in some form for exactly this purpose, but it's unclear in exactly what form. It really comes down to what is the smallest transactional unit of testing we want to do. If we're ok with that being at the benchmark & profile level, then the table as it stands is fine.

Jul 28 '21 12:07 rylev