Reproducibility issue
Hello, i am trying to reproduce the results of gemma3-4b-it. can i have the validation script. or more description of how to get similar results? any example is welcome. thank you
Thank you! Indeed, we should add examples to run evaluations on common datasets
Hi @sorobedio,
You might find the following references helpful for reproducing the benchmark results. Please take a look.
Official Gemma Paper: "Gemma: Open Models Based on Gemini Research and Technology"
Hugging Face Leaderboard: For comparing your results with published benchmarks
Eleuther AI Evaluation Harness: The framework used in the code
Thank you.
thank you for your answer. i used this command.
lm_eval --model hf \ --model_args pretrained=google/gemma-3-4b-it\ --tasks winogrande\ --device "cuda:0" \ --num_fewshot 5\ --apply_chat_template\ --batch_size 4 \ --fewshot_as_multiturn
and got error ing_gemma3.py", line 889, in __init__ self.model = Gemma3TextModel(config) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llm/lib/python3.12/site-packages/transformers/models/gemma3/modeling_gemma3.py", line 622, in __init__ self.vocab_size = config.vocab_size ^^^^^^^^^^^^^^^^^ File "/home/user/miniconda3/envs/llm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 214, in __getattribute__ return super().__getattribute__(key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'Gemma3Config' object has no attribute 'vocab_size' so i am still do not know how you use the vllm model on the text dataset with lm-harness framework.
@sorobedio I think if you refer to huggingface docs you will find your answer. You can refer to multiple templates of benchmarking on the platform regarding multiple models including Gemma 3 family of models too. Hope that helps.
Could you please confirm if this issue is resolved for you with the above comment ? Please feel free to close the issue if it is resolved ?
Thank you.
Is the actual inference code for multimodal benchmarks available anywhere ? Trying to run any visual benchmark yields very different results from the paper.