transformers bart-large-xsum model: There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm

System Info

transformers version: 4.37.2
Platform: Linux-5.15.133+-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.26.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.2 (True)
Tensorflow version (GPU?): 2.15.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (gpu)
Jax version: 0.4.23
JaxLib version: 0.4.23.dev20240116
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Thanks for your great work.

Please take a look at the notebook below in Kaggle. https://www.kaggle.com/code/aisuko/text-summarization-with-bart-series-llm/notebook

After training process finished it will show the warning message below

Some non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 62, 'min_length': 11, 'early_stopping': True, 'num_beams': 6, 'no_repeat_ngram_size': 3, 'forced_eos_token_id': 2}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].

And the fine-tuned model cannot be used to do inference. I saw a similar type of issue https://github.com/huggingface/transformers/issues/27972

Expected behavior

No warning issue and I can use the fine-tuned model to do inference.

Feb 20 '24 07:02 Aisuko

cc @ArthurZucker @younesbelkada

Feb 20 '24 12:02 amyeroberts

Hey @Aisuko, could you provide a minimal reproducer ? That would help use! Also note that the generation parameters issues can probably be safely ignored. The missing keys is however a bit more problematic! Might be tied weights that are not tied properly, is tie_word_embeddings used ?

Feb 20 '24 13:02 ArthurZucker

Hi, guys. Thanks for your quick response.

The minimal code see below, the code only including the steps of processing data and training. And we cat get same result from it. https://www.kaggle.com/code/aisuko/minimal-reproducer-for-issue-29128/notebook

The embedding process without using tie_word_embeddings parameter.

libraries

!pip install transformers==4.37.2
!pip install datasets==2.17.0
!pip install evaluate==0.4.1
!pip install rouge-score==0.1.2

Code

# Import libraries
import os
import re
import nltk
import pandas as pd
import numpy as np
import warnings
from datasets import Dataset
from datasets import load_metric
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import BartForConditionalGeneration
from transformers import DataCollatorForSeq2Seq
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

os.environ['MODEL']='facebook/bart-large-xsum'
os.environ["WANDB_NAME"] = "ft-facebook-bart-large-xsum-on-samsum"

warnings.filterwarnings('ignore')

# Loading and preprocessing data from https://www.kaggle.com/datasets/nileshmalode1/samsum-dataset-text-summarization
train=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-train.csv')
test=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-test.csv')
val=pd.read_csv('/kaggle/input/samsum-dataset-text-summarization/samsum-validation.csv')

def clean_tags(text):
    clean=re.compile('<.*?>') # compiling tags
    clean=re.sub(clean, '', text) # replacing tags text by an empty string
    
    # removing empty dialogues
    clean='\n'.join([line for line in clean.split('\n') if not re.match('.*:\s*$', line)])
    return clean

def clean_df(df, cols):
    for col in cols:
        df[col]=df[col].fillna('').apply(clean_tags)
    return df

train=clean_df(train, ['dialogue','summary'])
test=clean_df(test, ['dialogue', 'summary'])
val=clean_df(val, ['dialogue', 'summary'])

train_ds=Dataset.from_pandas(train)
test_ds=Dataset.from_pandas(test)
val_ds=Dataset.from_pandas(val)

# Tokenizer
tokenizer=BartTokenizer.from_pretrained(os.getenv('MODEL'))

def preprocess_func(example):
    # Iterating over every `dialogue` in the datset and saving them as input to the model
    inputs=[doc for doc in example['dialogue']]
    # we use tokenizer convert the input dialogues into tokens that can be easily understood by the BART model.
    # The truncation=True parameter ensures that all dialogues have a maximum number of 1024 tokens, as defined by the `max_length` parameter
    model_inputs=tokenizer(inputs, max_length=1024, truncation=True)
    
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        # we tokenizes the target variable, which is our summaries. And we expect summaries to be a much shorter text than that of dialogues max_length=128
        labels=tokenizer(example['summary'], max_length=128, truncation=True)
    
    # we adding the tokenized labels to the preprocessed dataset, alongside the tokenized inputs.
    model_inputs['labels']=labels['input_ids']
    return model_inputs


tokenized_train= train_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])
tokenized_test=test_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])
tokenized_val=val_ds.map(preprocess_func, batched=True, remove_columns=['id', 'dialogue', 'summary'])

# Loading the model
model=BartForConditionalGeneration.from_pretrained(os.getenv('MODEL'))

# Loading DataCollator
data_collator= DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

# Customizing metrics
metric=load_metric('rouge')

nltk.download('punkt')  # this divides a text into a list of sentences

def compute_metrics(eval_pred):
    predictions, labels=eval_pred # obtaining predictions and true labels
    
    # decoding predictions
    decoded_preds=tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # obtaining the true labels tokens, while eliminating any possible masked token (i.e: label=-100)
    labels=np.where(labels!=-100, labels, tokenizer.pad_token_id)
    decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # rouge expects a newline after each sentence
    decoded_preds=['\n'.join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels=['\n'.join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    
    # computing rouge score
    result=metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    result={key: value.mid.fmeasure*100 for key, value in result.items()} # extracting some results
    
    # add mean-genrated length
    prediction_lens=[np.count_nonzero(pred!=tokenizer.pad_token_id) for pred in predictions]
    result['gen_len']=np.mean(prediction_lens)
    return {k: round(v,4) for k,v in result.items()}


# Training
training_args=Seq2SeqTrainingArguments(
    output_dir=os.getenv('WANDB_NAME'),
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='eval_loss',
    seed=42,
    learning_rate=2e-5,
    max_steps=100,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1, # only for testing
    predict_with_generate=True,
    fp16=True,
    report_to='none',
)

trainer=Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Feb 21 '24 03:02 Aisuko

Sorry, could you just push a model to the hub, (the one you trained). No need for the full training loop

Feb 23 '24 08:02 ArthurZucker

I solved this after downgrading Transformer to 4.2x

Feb 27 '24 23:02 QishengL

Sorry, could you just push a model to the hub, (the one you trained). No need for the full training loop

Here you are https://huggingface.co/aisuko/ft-facebook-bart-large-xsum-on-samsum. The training parameters please see the notebook above.

Feb 29 '24 04:02 Aisuko

I solved this after downgrading Transformer to 4.2x

Good for you

Feb 29 '24 04:02 Aisuko

Is there any solution to this?

Mar 09 '24 08:03 humanely

@humanely do you have the exact same issue? If not then open a separate issue. 1. the checkpoints you have did not save ['lm_head.weight', 'model.decoder.embed_tokens.weight']. Now if you use tie_word_embeddings or any such things, then when saving the safetensors file, you would have gotten a warning saying that duplicated memory are removed. 2. When I loaded here is what I got:


In [1]: from transformers import AutoModelForCausalLM
modle =
In [2]: model = AutoModelForCausalLM.from_pretrained("aisuko/ft-facebook-bart-large-xsum-on-samsum")
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.59k/1.59k [00:00<00:00, 14.2MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 1.63G/1.63G [00:04<00:00, 349MB/s]
Some weights of BartForCausalLM were not initialized from the model checkpoint at aisuko/ft-facebook-bart-large-xsum-on-samsum and are newly initialized: ['lm_head.weight', 'model.decoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 274/274 [00:00<00:00, 2.64MB/s]

So already not the same missing weights. I cannot debug this for you, when you save the checkpoint make sure the versions of the codes are similar. Might be a save_pretrained / from_pretrained issue but I would need a very simple reproducer, without the whole training pipeline that hsould not influence this

Mar 27 '24 03:03 ArthurZucker

I have run into this exact same issue today (with the exact same list of missing fields) but with a different model facebook/nllb-200-distilled-600M. I took the time to create a test repository for someone at HF to clone: https://github.com/vanguardapps/cs224u-exploration/tree/hf-test (please use the hf-test branch). See all version specifications here: https://github.com/vanguardapps/cs224u-exploration/blob/hf-test/requirements.txt.

To run:

git clone [email protected]:vanguardapps/cs224u-exploration.git
cd cs224u-exploration
python -m venv .venv
pip install -r requirements.txt
python main.py

Here is the output I received after running only once, which is peculiar, as I assumed this was an issue when loading a checkpoint, but somehow it appears to be happening when saving as well? Anyway, if needed you can run the code again after to see what happens when it tries to load the checkpoint it saved last time (it will adjust, no need to change the code when running twice in a row):

(.env) roy@FRIDAY: python main.py
CUDA is available. Using CUDA.
Map (num_proc=8): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:02<00:00, 148.06 examples/s]
/home/roy/.local/lib/python3.10/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by promote_options='default'.
  table = cls._concat_blocks(blocks, axis=0)
Map (num_proc=8): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:02<00:00, 34.06 examples/s]
Map (num_proc=8): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:02<00:00, 34.35 examples/s]
/home/roy/.local/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
  warnings.warn(
  0%|                                                                                                                                                               | 0/11 [00:00<?, ?it/s]/home/roy/.local/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
{'loss': 8.7823, 'learning_rate': 0.0, 'epoch': 1.0}                                                                                                                                       
{'eval_loss': 8.63387680053711, 'eval_score': 5.109115280491206, 'eval_counts': [372, 125, 69, 42], 'eval_totals': [2223, 2148, 2073, 1998], 'eval_precisions': [16.734143049932523, 5.8193668528864055, 3.3285094066570187, 2.1021021021021022], 'eval_bp': 1.0, 'eval_sys_len': 2223, 'eval_ref_len': 1919, 'eval_runtime': 11.9198, 'eval_samples_per_second': 6.292, 'eval_steps_per_second': 0.839, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:17<00:00,  2.25it/sSome non-default generation parameters are set in the model config. These should go into a GenerationConfig file (https://huggingface.co/docs/transformers/generation_strategies#save-a-custom-decoding-strategy-with-your-model) instead. This warning will be raised to an exception in v4.41.
Non-default generation parameters: {'max_length': 200}
There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].
{'train_runtime': 23.8632, 'train_samples_per_second': 14.667, 'train_steps_per_second': 0.461, 'train_loss': 8.782309792258523, 'epoch': 1.0}                                             
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:23<00:00,  2.17s/it]

For context, I am in the current cohort of a Stanford course on NLU, and this is some code I am using to explore project ideas. I was hoping to use the Trainer API to save and load checkpoints, but I feel uncertain about whether it is safe to do so given the errors. I may end up downgrading to resolve it for now, but I have not tried that yet. Let me know if you have questions and if there's anything else I can provide. I ran the sample fine-tuning above locally on an RTX 4090 GPU.

Mar 30 '24 18:03 vanguardapps

@vanguardapps it is usually very safe to do use the trainer, and pretty much rare to have a bug. I don't know which version of transformers you are using, but better to update to latest, same for safetensors! Other than that, try to push your model to the hub before and after and share the links here.

If you can find a smaller reproducer would be a lot better. Like a 20line code with a small code to trainer that way we can better help!

Mar 30 '24 19:03 ArthurZucker

@ArthurZucker I upgraded per your recommendation to the latest versions of safetensors 0.4.2 and transformers 4.39.1 and the issue continues to happen. I will try to find time to boil the code down to a smaller reproducible code example. However, I think the essential thing that's happening is:

save_strategy="epoch"
when Trainer goes to save the model after one epoch even for the first time, this warning occurs (and worries me that the checkpoint is missing data).

I'll post here when I have a smaller code block. Thank you for your quick response.

Mar 30 '24 19:03 vanguardapps

Thanks, would help me know if I should ping someone else or if this is ignorable, or if this is core modeling!

Mar 30 '24 19:03 ArthurZucker

@ArthurZucker I noticed something just now that probably explains why this is a rarer behavior to see. I am using load_best_model_at_end=True for my Seq2SeqTrainingArguments. When I switch this to load_best_model_at_end=False, the warning I was seeing disappears.

So it appears what is happening is, once it is done training an epoch (based on save_strategy="epoch"), it is then attempting to "load_best_model_at_end", and when it does that, some parameters are always said to be missing (since it is loading something that has not been fully saved yet). Something along these lines--I know that is fuzzy.

I would not sound the alarm. It appears to be localized to this one parameter to TrainingArguments, and may just be a minor related bug in Trainer. I am using Seq2SeqTrainer specifically.

I still plan to circle back with smaller code. Will need to hammer out this assignment for class first.

Mar 30 '24 19:03 vanguardapps

Awesome, that is already a good isolation. cc @pacman100, @muellerzr and @SunMarc when @vanguardapps share the reproducer, please have a look! 🤗

Apr 01 '24 07:04 ArthurZucker

It happens even with when load_best_model_at_end is not set at all. When resuming tuning from checkpoint. Strangely, the error doesn't show when loading the checkpoint for inference.

Apr 08 '24 08:04 MarcoDiMarek

cc @muellerzr @SunMarc

Jun 03 '24 14:06 amyeroberts

I don't have a reproducer to share unfortunately, but just wanted to mention that setting save_safetensors to False removed the warning for me.

Jun 04 '24 15:06 mgrenander

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 29 '24 08:06 github-actions[bot]

Sure

Jun 29 '24 11:06 Aisuko

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 24 '24 08:07 github-actions[bot]

bart-large-xsum model: There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight', 'lm_head.weight'].

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

libraries

Code