lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

[BUG] Cuda OOM with gpt-oss-20b

Open vince62s opened this issue 4 months ago • 10 comments

Describe the bug

A clear and concise description of what the bug is.

lighteval accelerate "model_name=openai/gpt-oss-20b,max_length=4096,skip_special_tokens=False,batch_size=1" "leaderboard|mmlu|5" --save-details --output-dir "openai_scores"

goes OOM (lighteval installed from main, kernels 0.10.1, accelerate 1.6.0, transformers 4.56.1)

using a RTX 5090

can this be fixed with some other settings?

vince62s avatar Sep 12 '25 15:09 vince62s

lighteval accelerate "model_name=openai/gpt-oss-20b,batch_size=1" "leaderboard|mmlu|3" --save-details --output-dir "openai_scores"

gives very bad results:

|                         Task                         |Version|Metric|Value |   |Stderr|
|------------------------------------------------------|-------|------|-----:|---|-----:|
|all                                                   |       |acc   |0.2683|±  |0.0329|
|leaderboard:mmlu:_average:3                           |       |acc   |0.2683|±  |0.0329|
|leaderboard:mmlu:abstract_algebra:3                   |       |acc   |0.3300|±  |0.0473|
|leaderboard:mmlu:anatomy:3                            |       |acc   |0.3481|±  |0.0412|
|leaderboard:mmlu:astronomy:3                          |       |acc   |0.3026|±  |0.0374|
|leaderboard:mmlu:business_ethics:3                    |       |acc   |0.2400|±  |0.0429|
|leaderboard:mmlu:clinical_knowledge:3                 |       |acc   |0.2566|±  |0.0269|
|leaderboard:mmlu:college_biology:3                    |       |acc   |0.2917|±  |0.0380|
|leaderboard:mmlu:college_chemistry:3                  |       |acc   |0.1600|±  |0.0368|
|leaderboard:mmlu:college_computer_science:3           |       |acc   |0.2400|±  |0.0429|
|leaderboard:mmlu:college_mathematics:3                |       |acc   |0.3100|±  |0.0465|
|leaderboard:mmlu:college_medicine:3                   |       |acc   |0.2832|±  |0.0344|
|leaderboard:mmlu:college_physics:3                    |       |acc   |0.1961|±  |0.0395|
|leaderboard:mmlu:computer_security:3                  |       |acc   |0.3200|±  |0.0469|
|leaderboard:mmlu:conceptual_physics:3                 |       |acc   |0.2638|±  |0.0288|
|leaderboard:mmlu:econometrics:3                       |       |acc   |0.2719|±  |0.0419|
|leaderboard:mmlu:electrical_engineering:3             |       |acc   |0.2759|±  |0.0372|
|leaderboard:mmlu:elementary_mathematics:3             |       |acc   |0.2831|±  |0.0232|
|leaderboard:mmlu:formal_logic:3                       |       |acc   |0.1587|±  |0.0327|
|leaderboard:mmlu:global_facts:3                       |       |acc   |0.3000|±  |0.0461|
|leaderboard:mmlu:high_school_biology:3                |       |acc   |0.2774|±  |0.0255|
|leaderboard:mmlu:high_school_chemistry:3              |       |acc   |0.3005|±  |0.0323|
|leaderboard:mmlu:high_school_computer_science:3       |       |acc   |0.2800|±  |0.0451|
|leaderboard:mmlu:high_school_european_history:3       |       |acc   |0.2848|±  |0.0352|
|leaderboard:mmlu:high_school_geography:3              |       |acc   |0.2980|±  |0.0326|
|leaderboard:mmlu:high_school_government_and_politics:3|       |acc   |0.3212|±  |0.0337|
|leaderboard:mmlu:high_school_macroeconomics:3         |       |acc   |0.2128|±  |0.0208|
|leaderboard:mmlu:high_school_mathematics:3            |       |acc   |0.2444|±  |0.0262|
|leaderboard:mmlu:high_school_microeconomics:3         |       |acc   |0.2143|±  |0.0267|
|leaderboard:mmlu:high_school_physics:3                |       |acc   |0.2517|±  |0.0354|
|leaderboard:mmlu:high_school_psychology:3             |       |acc   |0.3138|±  |0.0199|
|leaderboard:mmlu:high_school_statistics:3             |       |acc   |0.2083|±  |0.0277|
|leaderboard:mmlu:high_school_us_history:3             |       |acc   |0.2598|±  |0.0308|
|leaderboard:mmlu:high_school_world_history:3          |       |acc   |0.2700|±  |0.0289|
|leaderboard:mmlu:human_aging:3                        |       |acc   |0.3184|±  |0.0313|
|leaderboard:mmlu:human_sexuality:3                    |       |acc   |0.2824|±  |0.0395|
|leaderboard:mmlu:international_law:3                  |       |acc   |0.3058|±  |0.0421|
|leaderboard:mmlu:jurisprudence:3                      |       |acc   |0.2130|±  |0.0396|
|leaderboard:mmlu:logical_fallacies:3                  |       |acc   |0.3006|±  |0.0360|
|leaderboard:mmlu:machine_learning:3                   |       |acc   |0.2679|±  |0.0420|
|leaderboard:mmlu:management:3                         |       |acc   |0.1845|±  |0.0384|
|leaderboard:mmlu:marketing:3                          |       |acc   |0.3077|±  |0.0302|
|leaderboard:mmlu:medical_genetics:3                   |       |acc   |0.2300|±  |0.0423|
|leaderboard:mmlu:miscellaneous:3                      |       |acc   |0.3001|±  |0.0164|
|leaderboard:mmlu:moral_disputes:3                     |       |acc   |0.3092|±  |0.0249|
|leaderboard:mmlu:moral_scenarios:3                    |       |acc   |0.2559|±  |0.0146|
|leaderboard:mmlu:nutrition:3                          |       |acc   |0.2549|±  |0.0250|
|leaderboard:mmlu:philosophy:3                         |       |acc   |0.3119|±  |0.0263|
|leaderboard:mmlu:prehistory:3                         |       |acc   |0.2901|±  |0.0253|
|leaderboard:mmlu:professional_accounting:3            |       |acc   |0.3227|±  |0.0279|
|leaderboard:mmlu:professional_law:3                   |       |acc   |0.2699|±  |0.0113|
|leaderboard:mmlu:professional_medicine:3              |       |acc   |0.1654|±  |0.0226|
|leaderboard:mmlu:professional_psychology:3            |       |acc   |0.2941|±  |0.0184|
|leaderboard:mmlu:public_relations:3                   |       |acc   |0.2091|±  |0.0390|
|leaderboard:mmlu:security_studies:3                   |       |acc   |0.2367|±  |0.0272|
|leaderboard:mmlu:sociology:3                          |       |acc   |0.2736|±  |0.0315|
|leaderboard:mmlu:us_foreign_policy:3                  |       |acc   |0.2800|±  |0.0451|
|leaderboard:mmlu:virology:3                           |       |acc   |0.2530|±  |0.0338|
|leaderboard:mmlu:world_religions:3                    |       |acc   |0.2865|±  |0.0347|

vince62s avatar Sep 13 '25 10:09 vince62s

hey ! you should look at details to be able to look at why the scores are low, here is how: https://huggingface.co/docs/lighteval/main/en/saving-and-reading-results

NathanHB avatar Sep 15 '25 08:09 NathanHB

Not sure what I should look at, but this simple command line is supposed to spit out a score close the 5 few-shots > 80 I guess. Am I missing something?

results_2025-09-13T12-43-44.973710.json

vince62s avatar Sep 15 '25 08:09 vince62s

you need to look at the detailed sample to sample results to see if there is an issue with the evals. low scores can be attributed to many things (wrong paramters is the most common). All is explained in the doc.

Side not, please copy paste relevant information in your message as i will not download files and read them on my machine thanks !

NathanHB avatar Sep 15 '25 09:09 NathanHB

well if you guys ran a mmlu on gpt-oss-20b with lighteval I would be delighted to see the scores.

vince62s avatar Sep 18 '25 10:09 vince62s

That is what I am getting as well for lm_eval for mmlu. I wonder what MMLU did OpenAI test with? Was it with tools?

tomtyiu avatar Sep 18 '25 18:09 tomtyiu

85.3 https://openai.com/fr-FR/index/introducing-gpt-oss/

vince62s avatar Sep 18 '25 18:09 vince62s

you might want to use the helm version of mmlu, you are using loglikelihood evaluation for an instruct model. I'm pretty sure they used generative evaluation with multiple sampling.

NathanHB avatar Sep 19 '25 10:09 NathanHB

lighteval accelerate "model_name=openai/gpt-oss-20b,batch_size=1" "helm|mmlu|3" --save-details --output-dir "openai_scores"

|                     Task                      |Version|                        Metric                        |Value |   |Stderr|
|-----------------------------------------------|-------|------------------------------------------------------|-----:|---|-----:|
|all                                            |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2312|±  |0.0315|
|helm:mmlu:_average:3                           |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2312|±  |0.0315|
|helm:mmlu:abstract_algebra:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2200|±  |0.0416|
|helm:mmlu:anatomy:3                            |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1852|±  |0.0336|
|helm:mmlu:astronomy:3                          |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1776|±  |0.0311|
|helm:mmlu:business_ethics:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.3000|±  |0.0461|
|helm:mmlu:clinical_knowledge:3                 |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2151|±  |0.0253|
|helm:mmlu:college_biology:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2569|±  |0.0365|
|helm:mmlu:college_chemistry:3                  |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2000|±  |0.0402|
|helm:mmlu:college_computer_science:3           |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2600|±  |0.0441|
|helm:mmlu:college_mathematics:3                |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2100|±  |0.0409|
|helm:mmlu:college_medicine:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2081|±  |0.0310|
|helm:mmlu:college_physics:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2157|±  |0.0409|
|helm:mmlu:computer_security:3                  |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2800|±  |0.0451|
|helm:mmlu:conceptual_physics:3                 |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2638|±  |0.0288|
|helm:mmlu:econometrics:3                       |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2368|±  |0.0400|
|helm:mmlu:electrical_engineering:3             |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2414|±  |0.0357|
|helm:mmlu:elementary_mathematics:3             |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2090|±  |0.0209|
|helm:mmlu:formal_logic:3                       |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2857|±  |0.0404|
|helm:mmlu:global_facts:3                       |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1800|±  |0.0386|
|helm:mmlu:high_school_biology:3                |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1774|±  |0.0217|
|helm:mmlu:high_school_chemistry:3              |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1527|±  |0.0253|
|helm:mmlu:high_school_computer_science:3       |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|±  |0.0435|
|helm:mmlu:high_school_european_history:3       |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2182|±  |0.0323|
|helm:mmlu:high_school_geography:3              |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1768|±  |0.0272|
|helm:mmlu:high_school_government_and_politics:3|       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1969|±  |0.0287|
|helm:mmlu:high_school_macroeconomics:3         |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2026|±  |0.0204|
|helm:mmlu:high_school_mathematics:3            |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2111|±  |0.0249|
|helm:mmlu:high_school_microeconomics:3         |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2101|±  |0.0265|
|helm:mmlu:high_school_physics:3                |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1987|±  |0.0326|
|helm:mmlu:high_school_psychology:3             |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1927|±  |0.0169|
|helm:mmlu:high_school_statistics:3             |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1528|±  |0.0245|
|helm:mmlu:high_school_us_history:3             |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|±  |0.0304|
|helm:mmlu:high_school_world_history:3          |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2700|±  |0.0289|
|helm:mmlu:human_aging:3                        |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.3139|±  |0.0311|
|helm:mmlu:human_sexuality:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2595|±  |0.0384|
|helm:mmlu:international_law:3                  |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2397|±  |0.0390|
|helm:mmlu:jurisprudence:3                      |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2593|±  |0.0424|
|helm:mmlu:logical_fallacies:3                  |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2209|±  |0.0326|
|helm:mmlu:machine_learning:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.3125|±  |0.0440|
|helm:mmlu:management:3                         |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1748|±  |0.0376|
|helm:mmlu:marketing:3                          |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2906|±  |0.0297|
|helm:mmlu:medical_genetics:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.3000|±  |0.0461|
|helm:mmlu:miscellaneous:3                      |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2375|±  |0.0152|
|helm:mmlu:moral_disputes:3                     |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2486|±  |0.0233|
|helm:mmlu:moral_scenarios:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2380|±  |0.0142|
|helm:mmlu:nutrition:3                          |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2255|±  |0.0239|
|helm:mmlu:philosophy:3                         |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1865|±  |0.0221|
|helm:mmlu:prehistory:3                         |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2160|±  |0.0229|
|helm:mmlu:professional_accounting:3            |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2340|±  |0.0253|
|helm:mmlu:professional_law:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2458|±  |0.0110|
|helm:mmlu:professional_medicine:3              |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1838|±  |0.0235|
|helm:mmlu:professional_psychology:3            |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2500|±  |0.0175|
|helm:mmlu:public_relations:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2182|±  |0.0396|
|helm:mmlu:security_studies:3                   |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.1878|±  |0.0250|
|helm:mmlu:sociology:3                          |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2438|±  |0.0304|
|helm:mmlu:us_foreign_policy:3                  |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2800|±  |0.0451|
|helm:mmlu:virology:3                           |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.2831|±  |0.0351|
|helm:mmlu:world_religions:3                    |       |em                                                    |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred                 |0.0000|±  |0.0000|
|                                               |       |em_with_type_exact_match                              |0.0000|±  |0.0000|
|                                               |       |em_with_normalize_gold&normalize_pred&type_exact_match|0.3216|±  |0.0358|

vince62s avatar Sep 19 '25 12:09 vince62s

in my previous experiments, whether you take the pretrained or the instruct model of a specific class, it gives similar results (+-5%) and the loglikelihood vs helm does not diverge that much.

vince62s avatar Sep 19 '25 12:09 vince62s