gemma Reproducibility issue

Hello, i am trying to reproduce the results of gemma3-4b-it. can i have the validation script. or more description of how to get similar results? any example is welcome. thank you

Mar 15 '25 04:03 sorobedio

Thank you! Indeed, we should add examples to run evaluations on common datasets

Mar 17 '25 10:03 Conchylicultor

Hi @sorobedio,

You might find the following references helpful for reproducing the benchmark results. Please take a look.

Official Gemma Paper: "Gemma: Open Models Based on Gemini Research and Technology"

Hugging Face Leaderboard: For comparing your results with published benchmarks

Eleuther AI Evaluation Harness: The framework used in the code

Thank you.

Mar 18 '25 07:03 Gopi-Uppari

thank you for your answer. i used this command.

lm_eval --model hf \ --model_args pretrained=google/gemma-3-4b-it\ --tasks winogrande\ --device "cuda:0" \ --num_fewshot 5\ --apply_chat_template\ --batch_size 4 \ --fewshot_as_multiturn

and got error ing_gemma3.py", line 889, in __init__ self.model = Gemma3TextModel(config) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llm/lib/python3.12/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 622, in __init__ self.vocab_size = config.vocab_size ^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 214, in __getattribute__ return super().__getattribute__(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' so i am still do not know how you use the vllm model on the text dataset with lm-harness framework.

Mar 29 '25 16:03 sorobedio

@sorobedio I think if you refer to huggingface docs you will find your answer. You can refer to multiple templates of benchmarking on the platform regarding multiple models including Gemma 3 family of models too. Hope that helps.

Mar 30 '25 13:03 0Aditya-Singhal0

Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?

Thank you.

Apr 17 '25 05:04 Gopi-Uppari

Is the actual inference code for multimodal benchmarks available anywhere ? Trying to run any visual benchmark yields very different results from the paper.

Jul 10 '25 05:07 alterdim