mergekit icon indicating copy to clipboard operation
mergekit copied to clipboard

Wrong answer in the merged model weights

Open tianshuocong opened this issue 2 years ago • 12 comments

Hi! Thanks for your great work!

I have two questions.

(1) When I use the following setting

models:
  - model: /data2/model/Quantize/llama2-chat_normal
    parameters:
      weight: 0.1
  - model: /data2/model/Quantize/llama2-chat_normal
    parameters:
      weight: 1.0
merge_method: linear
dtype: float32

and I print the detailed weights as

print("model1:")
print(model_1.state_dict()['model.embed_tokens.weight'][0,0:3])

print("model2:")
print(model_2.state_dict()['model.embed_tokens.weight'][0,0:3])

print("target:")
print(model_1.state_dict()['model.embed_tokens.weight'][0,0:3]*0.1 + 1.0*model_2.state_dict()['model.embed_tokens.weight'][0,0:3])

print("result:")
print(merged_model.state_dict()['model.embed_tokens.weight'][0,0:3])

The merged model weights are different from my target, I am confused (I also try normalize:False).

model1:  tensor([ 1.1921e-06, -1.7881e-06, -4.2915e-06])
model2:  tensor([ 1.1921e-06, -1.7881e-06, -4.2915e-06])
target:  tensor([ 1.3113e-06, -1.9670e-06, -4.7207e-06])
result:  tensor([ 1.1921e-06, -1.7881e-06, -4.2915e-06])

(2)My next question is how to merge llama-2-7b-chat and wizardmath-7b-v1.0. Although they are all fine-tuned from llama-2-7b, but the architectures are different.

  • wizardmath-7b-v1.0:
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32001, 4096, padding_idx=0)
  • llama-2-7b-chat
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)

Therefore, when I infer the merged model, it returns an error:

ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([32001, 4096])), this look incorrect.

However, I found that sometimes I can successfully merge two models, it is like a random event, does it mean the merging process is unstable?

tianshuocong avatar Mar 26 '24 14:03 tianshuocong

Read about tokenizer merge here

Or short try tokenizer_source: union

NeonBohdan avatar Mar 26 '24 16:03 NeonBohdan

Read about tokenizer merge here

Or short try tokenizer_source: union

Hi! Thanks for your help.

I print the detailed weights, and it is a wrong answer when using linear, but, when I use task_arithmetic, it does not return errors like ValueError: Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([32001, 4096])), this looks incorrect. And, the detailed weights are also right, so I guess there are some errors in the implementation in the linear method.

tianshuocong avatar Mar 26 '24 16:03 tianshuocong

I cannot reproduce the first issue. Please double-check that normalize is set to False, and it should produce the exact same tensor as you expected.

Regarding the second issue, when tokenizer_source is empty, it results in the legacy behavior:

  • The merged model will always use the first (base) model’s vocab_size, which is 32001 if wizardmath-7b-v1.0 is your first model. https://github.com/arcee-ai/mergekit/blob/4ecb205d191a9d76c50ab166ae05712619709277/mergekit/merge.py#L154-L157
  • The embedding layers will be truncated to the smallest size present in the merge (i.e., 32000).

The inconsistency between vocab_size and the shape of the embedding layers would prevent you from loading the merged model.

To solve this, you can either:

  • Change vocab_size in the config.json of your merged model to 32000, or
  • Specify tokenizer_source: model:meta/llama-2-7b-chat and merge your models again. (using union will not work because the length of the unioned tokenizer is 32001, while linear merging will always truncate the embedding layers to 32000) https://github.com/arcee-ai/mergekit/blob/4ecb205d191a9d76c50ab166ae05712619709277/mergekit/merge.py#L94-L95

eggry avatar Mar 29 '24 00:03 eggry

@eggry I am facing same issues and already tried what you suggested by changing the vocab_size in the config.json but that is giving another error

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 134, in __init__
    assert padding_idx < self.num_embeddings, 'Padding_idx must be within num_embeddings'
AssertionError: Padding_idx must be within num_embeddings

monk1337 avatar Apr 19 '24 07:04 monk1337

Hello, @monk1337, After examining the tips of LLaMA2 model, I've noticed that your merged tokenizer might explicitly specify a pad_token. To confirm this, please review if your tokenizer_config.json sets the pad_token to a non-null value, AND this token is presented in your added_tokens.json. If that is the case, resetting it to null could resolve your problem. In case this pad token is necessary for your workflow, you can manually add them back after the tokenizer is loaded, as suggested in the LLaMA2 model's documentation.

eggry avatar Apr 19 '24 08:04 eggry

@eggry Yes, I just checked and

tokenizer_config.json sets the "pad_token": "<|end_of_turn|>",

and the added_tokens.json looks like this

{
  "<|end_of_turn|>": 32000,
  "<|pad_0|>": 32001
}

Now, tokenizer_config.json sets the "pad_token": null and in config.json set the "vocab_size": 32000 and in deleted the added_tokens.json content

The error is same

monk1337 avatar Apr 19 '24 08:04 monk1337

Hello, @monk1337, Maybe your model's config.json have also specified an pad_token_id. If that is the case, simply comment out this entry or changing the value to -1 may resolve the error.

eggry avatar Apr 19 '24 08:04 eggry

@eggry, it worked. Which solution is better?

  1. Changing vocab and padding in an already merged model.
  2. Defining a higher on vocab tokenizer in tokenizer_source: during merging?

monk1337 avatar Apr 19 '24 08:04 monk1337

@monk1337, From my personal opinion, I prefer

  1. Carefully specify the model orders in base_model and model to ensure that the model with the smallest embedding size is the first one (so that the configuration of the merged model will be based on this model), and
  2. Specify tokenizer_source to this one.

However, I suspect there is no such a universal solution for merging models whose tokenizer/embedding layer differs in the existing implementation: The model configuration & embedding layer & tokenizer & lm_head are all matters.

eggry avatar Apr 19 '24 09:04 eggry

@eggry This recipe makes sense. I'm going to try it. Thank you! It was quite helpful.!

monk1337 avatar Apr 19 '24 09:04 monk1337

@eggry Sorry to bug again, here I am trying to merge llama-3 and starling model as trying something what you suggested but getting error.

slices:
  - sources:
      - model: Nexusflow/Starling-LM-7B-beta
        layer_range: [0, 32]
      - model: meta-llama/Meta-Llama-3-8B
        layer_range: [0, 32]

merge_method: slerp
base_model: Nexusflow/Starling-LM-7B-beta
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: float16
tokenizer_source: model:Nexusflow/Starling-LM-7B-beta

this is the error

  File "/workspace/axolotl/out/mergekit/mergekit/merge_methods/tokenizer_permute.py", line 88, in execute
    torch.tensor(weights, dtype=expanded.dtype, device=expanded.device)
TypeError: must be real number, not NoneType

monk1337 avatar Apr 20 '24 03:04 monk1337

I'm facing the same problem for Linear Merge.. I solded it by Tokenizer Merge suggested by @NeonBohdan It works now.

uygarkurt avatar May 05 '24 21:05 uygarkurt