ExplainaBoard Redefine printed out examples

Currently, ExplainaBoard prints out some error cases (for each bucket) in its report but the way of doing so is ad-hoc. Here is a suggestion for how to improve:

we will define error scores with respect to each example (overall or in an individual bucket)
either in report.json or the web interface, we write out a maximum of X examples with the highest error scores according to each definition of error score

Single System Analysis

The ranking function for examples can be:

random(): This will output random examples
lowest_score(system, metric): This will output examples with the lowest score for each metric on system system

Multi-system Analysis

The functions for single-system analysis can be used for each system, or we can also use other ones with a holistic view:

lowest_average_score(systems, metric): Lowest metric score on average across all systems.
highest_score_variance(systems, metric): Highest metric score variance across all systems.
lowest_relative_score(systems, sys_id, metric): Lowest value of the metric for systems[sys_id] minus the average of all metric values for the other systems

Apr 27 '22 14:04 neubig

For Single System Analysis, I think we can add another "metric score" column in the table. In this case, it would be per bucket. We can add a button to sort the scores ascending or descending. This allows users to see the best performing samples and worst performing samples.

Sep 14 '22 17:09 noelchen90

@pfliu-nlp After a discussion with @OscarWang114, we would like to bring up some confusions/discussions we had:

Is there a reason why we display only 50 examples (in this line)? Why not display all the examples in the bucket? Can't we load the 10 examples every time the user clicks on "previous" or "next" page?
Following on the previous question, suppose we only want to display a selected number of examples. Then I think for each metric, we need to maintain two lists (suppose we have n metrics, then this is n*2 subsampled_id lists total). One for the highest metric score, and the other for the lowest metric score. Currently, we are using the same list of subsampled_ids for each metric (which is only 2 lists) . But this may not be the most informative approach since an example might have a really high score on metric A but an average score on metric B (not the highest nor the lowest)
What is the best way to calculate a metric score given one sample ID? The calculation of metric seems to be designed to process a batch of sample Ids (useful in the case of bucket analysis).

Sep 19 '22 20:09 noelchen90