[Demo] Benchmark Suite for Evaluating Gemma Models
Hey Gemma community and developers!
I'm happy to share a project I've been pouring effort into, partly as a showcase for my Google Summer of Code (GSoC) application: an open-source benchmarking suite for Gemma models! I really hope it can be a helpful tool for all of us exploring these models.
What's the idea? My main goal was to create an easy-to-use tool to systematically check how Gemma models (different sizes, variants) perform on common academic tasks and see how they stack up against others like Llama 2 or Mistral.
What's inside?
- The Core Engine A Python framework handling model/dataset loading, running tests, and gathering results.
- Benchmark Tasks Implementations for key tests like MMLU (knowledge/reasoning) and efficiency checks (speed/memory). Structured for adding more!
- Easy Comparisons Designed to compare different Gemma versions and other open models side-by-side.
- Helper Scripts Command-line tools for downloading datasets and running benchmarks.
- Visualizations Auto-generates charts (heatmaps, comparisons, efficiency graphs) using Matplotlib.
Where it's at Right Now (Progress & Next Steps) I'd estimate the core framework, the structure for benchmarks like MMLU/efficiency, data handling, and visualization tools are largely complete, perhaps around 60% of the planned features for the suite itself.
The next step, and the main piece remaining, is swapping out the current 'mock' model interfaces (which were useful for building the structure) with code that loads and runs the actual language models (e.g., using Hugging Face Transformers/Keras).
Check it out! The code is on GitHub: https://github.com/heilcheng/gemma-benchmark/
Since this is also part of my GSoC application showcase, I'd be especially grateful for any feedback or insights you might have! Ideas for features, suggestions, or just your general thoughts are all welcome.
Thanks for taking a look!
Hi,
This is a highly valuable and well-structured project for the Gemma community!
The creation of an open-source benchmarking suite designed for systematic evaluation is an essential contribution. We particularly appreciate the structured Core Engine supporting easy addition of benchmarks like MMLU and efficiency checks, alongside the automatic generation of visualizations.
This effort, especially as a GSoC showcase, is commendable. We look forward to the successful integration of the actual language model interfaces. Thank you for sharing your work; it promises to be a powerful asset.
Thanks.