scout Improved outputs to analyze integration test results

Fixes https://github.com/trynthink/scout/issues/415

Introduces a class to compare integration test results on a branch with the results stored on master. A previous PR (https://github.com/trynthink/scout/pull/440) added all integration test results to master. This PR provides a way of evaluating the differences between the working branch and master, which include:

Store plot pdfs as CI artifacts so that they can be visually compared against master
New script compare_results.py to:
- Output differences in keys between the branches' agg_results and ecm_results (output to agg_results_key_diffs.csv and ecm_results_key_diffs.csv)
- Output percent differences in values between the branches' agg_results and ecm_results, as long as values meet an absolute threshold and the differences meet a percent threshold (output to agg_results_value_diffs.csv and ecm_results_value_diffs.csv)
- Output percent differences in values between branches' Summary_Data-MAP.xlsx and Summary_Data-TP.xlsx (output to Summary_Data-MAP_percent_diffs.csv and Summary_Data-TP_percent_diffs.csv)
Update the Github Actions workflow so that when there are differences between the branch and master agg_results.json or ecm_results.json, then:
- Commit new results and plots (same as before)
- Pull down agg_results.json, ecm_results.json, Summary_Data-TP.xlsx, Summary_Data-MAP.xlsx from master, store in tests/integration_tests/results_base
- Run tests/integration_tests/compare_results.py
- Store the output csvs described above as CI artifacts

Example Outputs Example CI artifacts are found at https://github.com/trynthink/scout/actions/runs/11943859504

Example *_results_key_diffs.csv:

Example *_results_value_diffs.csv:

Example Summary_Data-*_percent_diffs.xlsx: Same format as original xlsx files, but values are the percent differences

Nov 16 '24 01:11 aspeake

Remaining tasks

Revisit threshold for reporting diffs (is 10% and 1,000 correct?)
Add two columns to *_results_value_diffs.csv that show the original, absolute values to provide context for the percentage diffs.

Jan 24 '25 21:01 aspeake

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: https://github.com/trynthink/scout/pull/469/commits/5d8680e33c113472e6d550da57d71f096077dc6e

Mar 04 '25 19:03 aspeake

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%.

A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

Mar 10 '25 15:03 jtlangevin

@jtlangevin this is ready for your review (pending CI). Per your comment, I updated the absolute threshold for reporting to depend on the units being compared, where it is 1,000 if cost or energy, and 10 if it emissions. This means that one or both of the values must be greater than that (not that the difference is greater). The percent threshold remains the same for all, 10%. A test case with results that change can be found here: https://github.com/trynthink/scout/actions/runs/13660717679. In the artifacts you will see *_diffs.csv files that summarize key and value differences, as well as Summary_Data* files for differences in the summary files. Because there were differences in results in that branch, the CI automatically uploads the new results, but also the plots now: 5d8680e

Thanks, I see the plots in the commit which is very helpful.

In the artifacts it looks like only the aggregate results are being differenced, and not results for individual ECMs – e.g., agg_results_value_diffs.csv exists but no similar file for individual ECMs. Only the file ecm_results_key_diffs.csv but it's unclear how that file is supposed to be read. I think we'll want to isolate which individual ECMs are causing the differences in values for cases where we change calculations that should only apply to a certain ECM or certain ECMs.

So the artifacts in that dummy PR are dependent of how the results changed. I just trimmed down the list of ECMs, meaning that there are a lot of differences in the json keys for ecm_results (found in ecm_results_key_diffs.csv), but the actual values of the common ECMs did not change, so ecm_results_value_diffs.csv was not produced.

In the PR description, the second screenshot under "Example *_results_value_diffs.csv:" shows pretty close what the ecm_results_value_diffs.csv would look like.

Mar 10 '25 16:03 aspeake