mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

[Feature Request] Deterministic Base-Architecture Preference for Cross-Family Merges

Open ikhyunAn opened this issue 5 months ago • 0 comments

[Feature Request] Deterministic Base-Architecture Preference for Cross-Family Merges

Summary

When merging models from different architectural families (e.g., Llama + Qwen, Llama + Gemma), mergekit's architecture selection is non-deterministic, sometimes choosing the merge model's architecture instead of the base model's. This causes errors when the selected architecture requires tensors (like k_norm) that don't exist in the base model.

Current Behavior

Problem 1: Non-deterministic Architecture Selection

When merging cross-family models, MergeConfiguration.referenced_models() uses an unordered set, making architecture selection unpredictable. The output model may require tensors from the merge model's architecture that aren't present in the base model.

Example Error:

RuntimeError: Tensor model.layers.0.self_attn.k_norm.weight required but not present in model TsinghuaC3I/Llama-3-8B-UltraMedical

This occurs when merging Llama-based models (no k_norm/q_norm) with Qwen/Gemma models (have k_norm/q_norm).

Problem 2: Required t Parameter Causes Issues

When using filtered slices for selective component merging, SLERP requires a t value for every tensor. This makes it difficult to merge only specific components while leaving others unchanged, as there's no graceful fallback mechanism.

Proposed Solution

1. Prefer Base Model Architecture (Primary Fix)

In mergekit/architecture/__init__.py, when multiple known architectures are present, explicitly prefer the base_model's architecture:

# In get_architecture_info() function
if config.base_model is not None:
    try:
        idx = models.index(config.base_model)
        return model_arch_info[idx]
    except ValueError:
        # base_model not in referenced models; fall back to first
        pass
return model_arch_info[0]

Benefits:

  • Deterministic: Same config always produces same architecture
  • Logical: Output structure matches the base model
  • Safe: Avoids requiring tensors that don't exist in base
  • Backward compatible: Only affects multi-architecture merges

2. Make t Parameter Optional in SLERP (Secondary Enhancement)

In mergekit/merge_methods/slerp.py, allow t to be optional and gracefully fall back to base model weights:

class SlerpTask(Task[torch.Tensor]):
    t: Optional[float]  # Changed from float
    
    def execute(self, **kwargs) -> torch.Tensor:
        # ... existing validation ...
        
        # Graceful fallback when t is None
        if self.t is None:
            return tensors[self.base_model]
        
        # ... rest of SLERP logic ...

class SlerpMerge(MergeMethod):
    def parameters(self) -> List[ConfigParameterDef]:
        return [ConfigParameterDef(name="t", required=False, default_value=None)]

Benefits:

  • Enables selective component merging without explicit t for every tensor
  • Avoids shape/broadcast errors when models have incompatible tensor shapes
  • Backward compatible: Existing configs with t values continue working
  • Non-destructive: Returns base model weight when t=None

Use Case

This enhancement enables selective cross-family component merging, useful for:

  1. Knowledge Transfer: Merge specific capabilities (e.g., math reasoning) from one model family to another
  2. Cross-Layer Merging: Map different layer indices between models (e.g., base layer 5 ← merge layer 10)
  3. Architectural Safety: Preserve base model's architecture while selectively incorporating components from different families

Example Config:

merge_method: slerp
base_model: meta-llama/Llama-3-8B
slices:
  - sources:
      - model: meta-llama/Llama-3-8B
        layer_range: [5, 6]
      - model: Qwen/Qwen2.5-7B
        layer_range: [10, 11]
    parameters:
      t:
        - filter: mlp.gate_proj.weight
          value: 0.5
        - filter: mlp.up_proj.weight
          value: 0.5
        - filter: mlp.down_proj.weight
          value: 0.5
        # All other tensors (attention, norms) use base via t=None

Testing

I've validated this approach with a successful cross-family merge:

  • Base: Llama-3-8B-UltraMedical (32 layers, Llama architecture)
  • Merge: II-Medical-8B (36 layers, Qwen architecture)
  • Result: Successfully merged attention components from selected layers, output maintains Llama architecture

The merged model (~18GB):

  • Has correct architecture (LlamaForCausalLM)
  • Contains no k_norm tensors (Qwen-specific)
  • Loads and works correctly

Implementation Impact

Minimal Changes Required:

  • mergekit/architecture/__init__.py: ~9 lines added
  • mergekit/merge_methods/slerp.py: ~12 lines modified
  • Total: ~21 lines of code

Backward Compatibility:

  • ✅ Same-family merges: No change in behavior
  • ✅ Existing configs: Continue working as before
  • ✅ Cross-family merges: Now deterministic and base-aligned

Alternative Approaches Considered:

  1. Config-level architecture override: Would require more extensive changes and API additions
  2. Fix referenced_models() ordering: Unreliable due to set semantics
  3. Post-merge tensor pruning: Treats symptoms, not root cause; risky

Related Issues

This relates to cross-architecture merging challenges discussed in the community. While --allow-crimes permits mixing architectures, it doesn't control which architecture is selected.

Questions for Maintainers

  1. Is preferring base_model's architecture acceptable default behavior?
  2. Would you prefer an explicit config option (e.g., architecture_source: base) instead?
  3. Are there use cases where the merge model's architecture should be preferred?

Additional Context

I have a working implementation of these changes in a fork and can provide:

  • Complete diff/patch
  • Additional test cases
  • Example configs demonstrating the feature

Happy to discuss implementation details or adjust the approach based on maintainer preferences.


Environment:

  • Use case: Medical/scientific layer-wise model merging with different architectures
  • Hardware: Multi-GPU setup for large model merging

ikhyunAn avatar Nov 04 '25 18:11 ikhyunAn