[BUG] Cuda OOM with gpt-oss-20b
Describe the bug
A clear and concise description of what the bug is.
lighteval accelerate "model_name=openai/gpt-oss-20b,max_length=4096,skip_special_tokens=False,batch_size=1" "leaderboard|mmlu|5" --save-details --output-dir "openai_scores"
goes OOM (lighteval installed from main, kernels 0.10.1, accelerate 1.6.0, transformers 4.56.1)
using a RTX 5090
can this be fixed with some other settings?
lighteval accelerate "model_name=openai/gpt-oss-20b,batch_size=1" "leaderboard|mmlu|3" --save-details --output-dir "openai_scores"
gives very bad results:
| Task |Version|Metric|Value | |Stderr|
|------------------------------------------------------|-------|------|-----:|---|-----:|
|all | |acc |0.2683|± |0.0329|
|leaderboard:mmlu:_average:3 | |acc |0.2683|± |0.0329|
|leaderboard:mmlu:abstract_algebra:3 | |acc |0.3300|± |0.0473|
|leaderboard:mmlu:anatomy:3 | |acc |0.3481|± |0.0412|
|leaderboard:mmlu:astronomy:3 | |acc |0.3026|± |0.0374|
|leaderboard:mmlu:business_ethics:3 | |acc |0.2400|± |0.0429|
|leaderboard:mmlu:clinical_knowledge:3 | |acc |0.2566|± |0.0269|
|leaderboard:mmlu:college_biology:3 | |acc |0.2917|± |0.0380|
|leaderboard:mmlu:college_chemistry:3 | |acc |0.1600|± |0.0368|
|leaderboard:mmlu:college_computer_science:3 | |acc |0.2400|± |0.0429|
|leaderboard:mmlu:college_mathematics:3 | |acc |0.3100|± |0.0465|
|leaderboard:mmlu:college_medicine:3 | |acc |0.2832|± |0.0344|
|leaderboard:mmlu:college_physics:3 | |acc |0.1961|± |0.0395|
|leaderboard:mmlu:computer_security:3 | |acc |0.3200|± |0.0469|
|leaderboard:mmlu:conceptual_physics:3 | |acc |0.2638|± |0.0288|
|leaderboard:mmlu:econometrics:3 | |acc |0.2719|± |0.0419|
|leaderboard:mmlu:electrical_engineering:3 | |acc |0.2759|± |0.0372|
|leaderboard:mmlu:elementary_mathematics:3 | |acc |0.2831|± |0.0232|
|leaderboard:mmlu:formal_logic:3 | |acc |0.1587|± |0.0327|
|leaderboard:mmlu:global_facts:3 | |acc |0.3000|± |0.0461|
|leaderboard:mmlu:high_school_biology:3 | |acc |0.2774|± |0.0255|
|leaderboard:mmlu:high_school_chemistry:3 | |acc |0.3005|± |0.0323|
|leaderboard:mmlu:high_school_computer_science:3 | |acc |0.2800|± |0.0451|
|leaderboard:mmlu:high_school_european_history:3 | |acc |0.2848|± |0.0352|
|leaderboard:mmlu:high_school_geography:3 | |acc |0.2980|± |0.0326|
|leaderboard:mmlu:high_school_government_and_politics:3| |acc |0.3212|± |0.0337|
|leaderboard:mmlu:high_school_macroeconomics:3 | |acc |0.2128|± |0.0208|
|leaderboard:mmlu:high_school_mathematics:3 | |acc |0.2444|± |0.0262|
|leaderboard:mmlu:high_school_microeconomics:3 | |acc |0.2143|± |0.0267|
|leaderboard:mmlu:high_school_physics:3 | |acc |0.2517|± |0.0354|
|leaderboard:mmlu:high_school_psychology:3 | |acc |0.3138|± |0.0199|
|leaderboard:mmlu:high_school_statistics:3 | |acc |0.2083|± |0.0277|
|leaderboard:mmlu:high_school_us_history:3 | |acc |0.2598|± |0.0308|
|leaderboard:mmlu:high_school_world_history:3 | |acc |0.2700|± |0.0289|
|leaderboard:mmlu:human_aging:3 | |acc |0.3184|± |0.0313|
|leaderboard:mmlu:human_sexuality:3 | |acc |0.2824|± |0.0395|
|leaderboard:mmlu:international_law:3 | |acc |0.3058|± |0.0421|
|leaderboard:mmlu:jurisprudence:3 | |acc |0.2130|± |0.0396|
|leaderboard:mmlu:logical_fallacies:3 | |acc |0.3006|± |0.0360|
|leaderboard:mmlu:machine_learning:3 | |acc |0.2679|± |0.0420|
|leaderboard:mmlu:management:3 | |acc |0.1845|± |0.0384|
|leaderboard:mmlu:marketing:3 | |acc |0.3077|± |0.0302|
|leaderboard:mmlu:medical_genetics:3 | |acc |0.2300|± |0.0423|
|leaderboard:mmlu:miscellaneous:3 | |acc |0.3001|± |0.0164|
|leaderboard:mmlu:moral_disputes:3 | |acc |0.3092|± |0.0249|
|leaderboard:mmlu:moral_scenarios:3 | |acc |0.2559|± |0.0146|
|leaderboard:mmlu:nutrition:3 | |acc |0.2549|± |0.0250|
|leaderboard:mmlu:philosophy:3 | |acc |0.3119|± |0.0263|
|leaderboard:mmlu:prehistory:3 | |acc |0.2901|± |0.0253|
|leaderboard:mmlu:professional_accounting:3 | |acc |0.3227|± |0.0279|
|leaderboard:mmlu:professional_law:3 | |acc |0.2699|± |0.0113|
|leaderboard:mmlu:professional_medicine:3 | |acc |0.1654|± |0.0226|
|leaderboard:mmlu:professional_psychology:3 | |acc |0.2941|± |0.0184|
|leaderboard:mmlu:public_relations:3 | |acc |0.2091|± |0.0390|
|leaderboard:mmlu:security_studies:3 | |acc |0.2367|± |0.0272|
|leaderboard:mmlu:sociology:3 | |acc |0.2736|± |0.0315|
|leaderboard:mmlu:us_foreign_policy:3 | |acc |0.2800|± |0.0451|
|leaderboard:mmlu:virology:3 | |acc |0.2530|± |0.0338|
|leaderboard:mmlu:world_religions:3 | |acc |0.2865|± |0.0347|
hey ! you should look at details to be able to look at why the scores are low, here is how: https://huggingface.co/docs/lighteval/main/en/saving-and-reading-results
Not sure what I should look at, but this simple command line is supposed to spit out a score close the 5 few-shots > 80 I guess. Am I missing something?
you need to look at the detailed sample to sample results to see if there is an issue with the evals. low scores can be attributed to many things (wrong paramters is the most common). All is explained in the doc.
Side not, please copy paste relevant information in your message as i will not download files and read them on my machine thanks !
well if you guys ran a mmlu on gpt-oss-20b with lighteval I would be delighted to see the scores.
That is what I am getting as well for lm_eval for mmlu. I wonder what MMLU did OpenAI test with? Was it with tools?
85.3 https://openai.com/fr-FR/index/introducing-gpt-oss/
you might want to use the helm version of mmlu, you are using loglikelihood evaluation for an instruct model. I'm pretty sure they used generative evaluation with multiple sampling.
lighteval accelerate "model_name=openai/gpt-oss-20b,batch_size=1" "helm|mmlu|3" --save-details --output-dir "openai_scores"
| Task |Version| Metric |Value | |Stderr|
|-----------------------------------------------|-------|------------------------------------------------------|-----:|---|-----:|
|all | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2312|± |0.0315|
|helm:mmlu:_average:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2312|± |0.0315|
|helm:mmlu:abstract_algebra:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2200|± |0.0416|
|helm:mmlu:anatomy:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1852|± |0.0336|
|helm:mmlu:astronomy:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1776|± |0.0311|
|helm:mmlu:business_ethics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.3000|± |0.0461|
|helm:mmlu:clinical_knowledge:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2151|± |0.0253|
|helm:mmlu:college_biology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2569|± |0.0365|
|helm:mmlu:college_chemistry:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2000|± |0.0402|
|helm:mmlu:college_computer_science:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2600|± |0.0441|
|helm:mmlu:college_mathematics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2100|± |0.0409|
|helm:mmlu:college_medicine:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2081|± |0.0310|
|helm:mmlu:college_physics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2157|± |0.0409|
|helm:mmlu:computer_security:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2800|± |0.0451|
|helm:mmlu:conceptual_physics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2638|± |0.0288|
|helm:mmlu:econometrics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2368|± |0.0400|
|helm:mmlu:electrical_engineering:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2414|± |0.0357|
|helm:mmlu:elementary_mathematics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2090|± |0.0209|
|helm:mmlu:formal_logic:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2857|± |0.0404|
|helm:mmlu:global_facts:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1800|± |0.0386|
|helm:mmlu:high_school_biology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1774|± |0.0217|
|helm:mmlu:high_school_chemistry:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1527|± |0.0253|
|helm:mmlu:high_school_computer_science:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|± |0.0435|
|helm:mmlu:high_school_european_history:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2182|± |0.0323|
|helm:mmlu:high_school_geography:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1768|± |0.0272|
|helm:mmlu:high_school_government_and_politics:3| |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1969|± |0.0287|
|helm:mmlu:high_school_macroeconomics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2026|± |0.0204|
|helm:mmlu:high_school_mathematics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2111|± |0.0249|
|helm:mmlu:high_school_microeconomics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2101|± |0.0265|
|helm:mmlu:high_school_physics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1987|± |0.0326|
|helm:mmlu:high_school_psychology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1927|± |0.0169|
|helm:mmlu:high_school_statistics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1528|± |0.0245|
|helm:mmlu:high_school_us_history:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|± |0.0304|
|helm:mmlu:high_school_world_history:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2700|± |0.0289|
|helm:mmlu:human_aging:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.3139|± |0.0311|
|helm:mmlu:human_sexuality:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2595|± |0.0384|
|helm:mmlu:international_law:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2397|± |0.0390|
|helm:mmlu:jurisprudence:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2593|± |0.0424|
|helm:mmlu:logical_fallacies:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2209|± |0.0326|
|helm:mmlu:machine_learning:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.3125|± |0.0440|
|helm:mmlu:management:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1748|± |0.0376|
|helm:mmlu:marketing:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2906|± |0.0297|
|helm:mmlu:medical_genetics:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.3000|± |0.0461|
|helm:mmlu:miscellaneous:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2375|± |0.0152|
|helm:mmlu:moral_disputes:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2486|± |0.0233|
|helm:mmlu:moral_scenarios:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2380|± |0.0142|
|helm:mmlu:nutrition:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2255|± |0.0239|
|helm:mmlu:philosophy:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1865|± |0.0221|
|helm:mmlu:prehistory:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2160|± |0.0229|
|helm:mmlu:professional_accounting:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2340|± |0.0253|
|helm:mmlu:professional_law:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2458|± |0.0110|
|helm:mmlu:professional_medicine:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1838|± |0.0235|
|helm:mmlu:professional_psychology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|± |0.0175|
|helm:mmlu:public_relations:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2182|± |0.0396|
|helm:mmlu:security_studies:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.1878|± |0.0250|
|helm:mmlu:sociology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2438|± |0.0304|
|helm:mmlu:us_foreign_policy:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2800|± |0.0451|
|helm:mmlu:virology:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.2831|± |0.0351|
|helm:mmlu:world_religions:3 | |em |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred |0.0000|± |0.0000|
| | |em_with_type_exact_match |0.0000|± |0.0000|
| | |em_with_normalize_gold&normalize_pred&type_exact_match|0.3216|± |0.0358|
in my previous experiments, whether you take the pretrained or the instruct model of a specific class, it gives similar results (+-5%) and the loglikelihood vs helm does not diverge that much.