gemma How can a fine-tuned model be exported to torch/transformers/gguf?

Is there a simple way to export the model to run on transformers or other inference engines?

Jul 08 '25 18:07 sirfz

Hi @sirfz ,

There are several ways to export the Gemma models, like Hugging Face Transformers, llama.cpp (for CPU-centric or Highly Optimized Local Inference), ONNX Runtime (for Cross-Platform Deployment and Hardware Acceleration).

Hugging Face Transformers is arguably the simplest for Python users, especially if you have sufficient GPU memory. You just pip install and from_pretrained.

llama.cpp is simple if you're comfortable with command-line tools and want highly optimized CPU inference or highly quantized GPU inference. The initial conversion step can be a slight hurdle.

ONNX Runtime is simple for deployment once the model is exported, but the export process itself can sometimes require troubleshooting for very large or complex models. It offers great long-term flexibility.

Thanks.

Jul 23 '25 08:07 Balakrishna-Chennamsetti

My first approach was to check if I can simply map the weights from my model (fine-tuned with jax gemma) to hf Gemma3 but quickly found out that the models architectures are a bit different between the two and the weights cannot be simply mapped 1-to-1. I understand it's still possible (and I did develop a plan with the help of Gemini but never had the time to actually try it). I was wondering if someone already has done it (didn't find anything from hf or llama.cpp when I last checked).

I found jax gemma's fine-tuning approach to be very convenient and frankly quite performant but I'm surprised there's no portability out of the box to serve on different inference engines (any practical pointers would be appreciated!)

Jul 23 '25 08:07 sirfz

Yes, that's correct different architecture causes the some kind of compatibility issues while doing the model save/load and exports to different engines. Please find the following reference to know more on how to export models through ONNX. If you would like to do the exports to on-devices related use case it's recommended to use TFLite exports.

Thanks.

Jul 28 '25 07:07 Balakrishna-Chennamsetti