[Feature Request] Deterministic Base-Architecture Preference for Cross-Family Merges
[Feature Request] Deterministic Base-Architecture Preference for Cross-Family Merges
Summary
When merging models from different architectural families (e.g., Llama + Qwen, Llama + Gemma), mergekit's architecture selection is non-deterministic, sometimes choosing the merge model's architecture instead of the base model's. This causes errors when the selected architecture requires tensors (like k_norm) that don't exist in the base model.
Current Behavior
Problem 1: Non-deterministic Architecture Selection
When merging cross-family models, MergeConfiguration.referenced_models() uses an unordered set, making architecture selection unpredictable. The output model may require tensors from the merge model's architecture that aren't present in the base model.
Example Error:
RuntimeError: Tensor model.layers.0.self_attn.k_norm.weight required but not present in model TsinghuaC3I/Llama-3-8B-UltraMedical
This occurs when merging Llama-based models (no k_norm/q_norm) with Qwen/Gemma models (have k_norm/q_norm).
Problem 2: Required t Parameter Causes Issues
When using filtered slices for selective component merging, SLERP requires a t value for every tensor. This makes it difficult to merge only specific components while leaving others unchanged, as there's no graceful fallback mechanism.
Proposed Solution
1. Prefer Base Model Architecture (Primary Fix)
In mergekit/architecture/__init__.py, when multiple known architectures are present, explicitly prefer the base_model's architecture:
# In get_architecture_info() function
if config.base_model is not None:
try:
idx = models.index(config.base_model)
return model_arch_info[idx]
except ValueError:
# base_model not in referenced models; fall back to first
pass
return model_arch_info[0]
Benefits:
- Deterministic: Same config always produces same architecture
- Logical: Output structure matches the base model
- Safe: Avoids requiring tensors that don't exist in base
- Backward compatible: Only affects multi-architecture merges
2. Make t Parameter Optional in SLERP (Secondary Enhancement)
In mergekit/merge_methods/slerp.py, allow t to be optional and gracefully fall back to base model weights:
class SlerpTask(Task[torch.Tensor]):
t: Optional[float] # Changed from float
def execute(self, **kwargs) -> torch.Tensor:
# ... existing validation ...
# Graceful fallback when t is None
if self.t is None:
return tensors[self.base_model]
# ... rest of SLERP logic ...
class SlerpMerge(MergeMethod):
def parameters(self) -> List[ConfigParameterDef]:
return [ConfigParameterDef(name="t", required=False, default_value=None)]
Benefits:
- Enables selective component merging without explicit t for every tensor
- Avoids shape/broadcast errors when models have incompatible tensor shapes
- Backward compatible: Existing configs with t values continue working
- Non-destructive: Returns base model weight when t=None
Use Case
This enhancement enables selective cross-family component merging, useful for:
- Knowledge Transfer: Merge specific capabilities (e.g., math reasoning) from one model family to another
- Cross-Layer Merging: Map different layer indices between models (e.g., base layer 5 ← merge layer 10)
- Architectural Safety: Preserve base model's architecture while selectively incorporating components from different families
Example Config:
merge_method: slerp
base_model: meta-llama/Llama-3-8B
slices:
- sources:
- model: meta-llama/Llama-3-8B
layer_range: [5, 6]
- model: Qwen/Qwen2.5-7B
layer_range: [10, 11]
parameters:
t:
- filter: mlp.gate_proj.weight
value: 0.5
- filter: mlp.up_proj.weight
value: 0.5
- filter: mlp.down_proj.weight
value: 0.5
# All other tensors (attention, norms) use base via t=None
Testing
I've validated this approach with a successful cross-family merge:
- Base: Llama-3-8B-UltraMedical (32 layers, Llama architecture)
- Merge: II-Medical-8B (36 layers, Qwen architecture)
- Result: Successfully merged attention components from selected layers, output maintains Llama architecture
The merged model (~18GB):
- Has correct architecture (
LlamaForCausalLM) - Contains no
k_normtensors (Qwen-specific) - Loads and works correctly
Implementation Impact
Minimal Changes Required:
-
mergekit/architecture/__init__.py: ~9 lines added -
mergekit/merge_methods/slerp.py: ~12 lines modified - Total: ~21 lines of code
Backward Compatibility:
- ✅ Same-family merges: No change in behavior
- ✅ Existing configs: Continue working as before
- ✅ Cross-family merges: Now deterministic and base-aligned
Alternative Approaches Considered:
- Config-level architecture override: Would require more extensive changes and API additions
- Fix referenced_models() ordering: Unreliable due to set semantics
- Post-merge tensor pruning: Treats symptoms, not root cause; risky
Related Issues
This relates to cross-architecture merging challenges discussed in the community. While --allow-crimes permits mixing architectures, it doesn't control which architecture is selected.
Questions for Maintainers
- Is preferring base_model's architecture acceptable default behavior?
- Would you prefer an explicit config option (e.g.,
architecture_source: base) instead? - Are there use cases where the merge model's architecture should be preferred?
Additional Context
I have a working implementation of these changes in a fork and can provide:
- Complete diff/patch
- Additional test cases
- Example configs demonstrating the feature
Happy to discuss implementation details or adjust the approach based on maintainer preferences.
Environment:
- Use case: Medical/scientific layer-wise model merging with different architectures
- Hardware: Multi-GPU setup for large model merging