llama.cpp Model: Granite MoE shared

Description

This PR adds support for the GraniteMoEShared architecture, matching the implementation in transformers. The model is an iteration on top of GraniteMoE and adds a shared expert to each MoE layer.

NOTE: There is not a public model with this architecture for testing yet, but it is a key building block for the just-released Granite 4 architecture.

May 02 '25 16:05 gabe-l-hart

@ngxson @ggerganov I've rebased this on master and made the following changes based on your suggestions:

Used checks based on hparams.n_ff_shexp rather than the architecture string (this is definitely cleaner and more extensible)
Moved all granite model construction to llm_build_granite and removed granite-specific conditionals from llm_build_llama
- NOTE: I also removed granite-specific conditionals from llm_build_deci which seem to have been there as copy-paste from llm_build_llama. I checked with the HF model config and it doesn't appear that the Deci models use these scale factors.

May 09 '25 18:05 gabe-l-hart

I agree with @ngxson that adding separate arch for MoE-shared is a bit redundant, but it's OK either way. The separate build function for Granite models is good refactoring.

I think we are ready to merge. @gabe-l-hart Are the models out?

May 12 '25 08:05 ggerganov

Handyman services I am available where you are at work 7047636355

El lun, 12 de may de 2025, 4:09 a. m., Georgi Gerganov < @.***> escribió:

ggerganov left a comment (ggml-org/llama.cpp#13269) https://github.com/ggml-org/llama.cpp/pull/13269#issuecomment-2871337137

I agree with @ngxson https://github.com/ngxson that adding separate arch for MoE-shared is a bit redundant, but it's OK either way. The separate build function for Granite models is good refactoring.

I think we are ready to merge. @gabe-l-hart https://github.com/gabe-l-hart Are the models out?

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/pull/13269#issuecomment-2871337137, or unsubscribe https://github.com/notifications/unsubscribe-auth/BQFRXQZ374ZK7P6I73LD5H326BJJZAVCNFSM6AAAAAB4KMMHUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNZRGMZTOMJTG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

May 12 '25 08:05 zunigasllc

@ggerganov @ngxson Now that I think more about it, I agree that we should not use a separate architecture name for this. We're not currently planning to release models with this architecture by itself, but we will be using it for the attention layers in the Granite 4.0 models which are a hybrid of mamba2 and this architecture (Granite MoE w/ shared expert).

I'll consolidate the changes to remove the extra enum later today.

May 12 '25 13:05 gabe-l-hart

@ggerganov @ngxson I've now removed GRANITE_MOE_SHARED as a standalone architecture and consolidated into GRANITE_MOE. I've verified that conversion and inference work as expected with both a GraniteMoeForCausalLM and GraniteMoeSharedForCausalLM model.

May 13 '25 03:05 gabe-l-hart