Redefine printed out examples
Currently, ExplainaBoard prints out some error cases (for each bucket) in its report but the way of doing so is ad-hoc. Here is a suggestion for how to improve:
- we will define error scores with respect to each example (overall or in an individual bucket)
- either in
report.jsonor the web interface, we write out a maximum of X examples with the highest error scores according to each definition of error score
Single System Analysis
The ranking function for examples can be:
- random(): This will output random examples
-
lowest_score(system, metric): This will output examples with the lowest score for each
metricon systemsystem
Multi-system Analysis
The functions for single-system analysis can be used for each system, or we can also use other ones with a holistic view:
-
lowest_average_score(systems, metric): Lowest
metricscore on average across all systems. -
highest_score_variance(systems, metric): Highest
metricscore variance across all systems. -
lowest_relative_score(systems, sys_id, metric): Lowest value of the
metricforsystems[sys_id]minus the average of allmetricvalues for the other systems
For Single System Analysis, I think we can add another "metric score" column in the table. In this case, it would be per bucket. We can add a button to sort the scores ascending or descending. This allows users to see the best performing samples and worst performing samples.
@pfliu-nlp After a discussion with @OscarWang114, we would like to bring up some confusions/discussions we had:
-
Is there a reason why we display only 50 examples (in this line)? Why not display all the examples in the bucket? Can't we load the 10 examples every time the user clicks on "previous" or "next" page?
-
Following on the previous question, suppose we only want to display a selected number of examples. Then I think for each metric, we need to maintain two lists (suppose we have n metrics, then this is n*2
subsampled_idlists total). One for the highest metric score, and the other for the lowest metric score. Currently, we are using the same list ofsubsampled_idsfor each metric (which is only 2 lists) . But this may not be the most informative approach since an example might have a really high score on metric A but an average score on metric B (not the highest nor the lowest) -
What is the best way to calculate a metric score given one sample ID? The calculation of metric seems to be designed to process a batch of sample Ids (useful in the case of bucket analysis).