ModernBERT ModernBertPreTrainedModel for word-level prediction tasks

Hi there! I am trying to fine-tune ModernBertPreTrainedModel to complete the task of predicting each word in a sentence.

I customized a model, but when the model class inherits ModernBertPreTrainedModel and uses the from_pretrained method to load the model, I found that the performance is completely different from when the model class directly inherits nn.Module for training (the latter is better).

However, this problem was not found on other categories of Bert based models on same task. This troubled me for some time, and I still can't figure out why this situation happened.

Thanks!

Jan 15 '25 03:01 QXGeraldMo

Hello,

If I get it right, you should be able to use ModernBertForMaskedLM for your task. I do not really know how you are doing the loading part, but maybe your are not properly loading the decoding head when you use the ModernBertPreTrainedModel whereas you do when using nn.Module?

Jan 15 '25 15:01 NohTow

@NohTow Thanks for your reply! Sorry for not making it clear. Predicting the label of each word in the whole sentence is just like sequence labeling, that's why am using ModernBertPreTrainedModel . I'll check if loading the decoder header is correct then.

Jan 16 '25 03:01 QXGeraldMo

@NohTow Thanks for your reply! Sorry for not making it clear. Predicting the label of each word in the whole sentence is just like sequence labeling, that's why am using ModernBertPreTrainedModel . I'll check if loading the decoder header is correct then.

@NohTow

model = ModernBert.from_pretrained(args.model_path, config=model_config, attn_implementation="flash_attention_2")
class CustomizedModel(ModernBertPreTrainedModel):
    def __init__(self, config):
        super(CustomizedModel, self).__init__(config)
        self.encoder = ModernBertModel(config=config)
        ...........
        self.init_weights()

model = CustomizedModel(args.model_path, config=model_config..................)
class CustomizedModel(nn.Module):
    def __init__(self, model_path, config................):
        super(CustomizedModel, self).__init__()
         self.encoder = ModernBertModel.from_pretrained(
            model_path, config=config, attn_implementation="flash_attention_2")

I found that after loading the model in these two ways, the weights of other layers except the norm layer are almost completely different. I don’t know if there is some problem with my loading method.... Thanks!

Jan 16 '25 06:01 QXGeraldMo

From the code I am reading, you are loading the weights correctly when doing

self.encoder = ModernBertModel.from_pretrained(
            model_path, config=config, attn_implementation="flash_attention_2")

The self.init_weights() method is used to randomly init the weights, for example if you want to do pretraining. The following code should work fine while having a class inheriting ModernBertPretrainedModel:

class CustomizedModel(ModernBertPreTrainedModel):
    def __init__(self, model_path, config................):
        super(CustomizedModel, self).__init__(config)
        self.encoder = ModernBertModel.from_pretrained(
            model_path, config=config, attn_implementation="flash_attention_2")

I think you can even do self.encoder = super().from_pretrained.from_pretrained(model_path, config=config, attn_implementation="flash_attention_2") in the class inheriting from ModernBertPreTrainedModel. Also, just keep in mind that it won't load the decoding head, just the encoder itself.

Jan 16 '25 08:01 NohTow

From the code I am reading, you are loading the weights correctly when doing
self.encoder = ModernBertModel.from_pretrained(
            model_path, config=config, attn_implementation="flash_attention_2")
The self.init_weights() method is used to randomly init the weights, for example if you want to do pretraining. The following code should work fine while having a class inheriting ModernBertPretrainedModel:
class CustomizedModel(ModernBertPreTrainedModel):
    def __init__(self, model_path, config................):
        super(CustomizedModel, self).__init__(config)
        self.encoder = ModernBertModel.from_pretrained(
            model_path, config=config, attn_implementation="flash_attention_2")
I think you can even do self.encoder = super().from_pretrained.from_pretrained(model_path, config=config, attn_implementation="flash_attention_2") in the class inheriting from ModernBertPreTrainedModel. Also, just keep in mind that it won't load the decoding head, just the encoder itself.

Sorry, the code I gave above has some flaws, it is

model = CustomizedModel.from_pretrained(args.model_path, config=model_config, attn_implementation="flash_attention_2")
class CustomizedModel(ModernBertPreTrainedModel):
    def __init__(self, config):
        super(CustomizedModel, self).__init__(config)
        self.encoder = ModernBertModel(config=config)
        ...........
        self.init_weights()

Because I usually use self.encoder = ModernBertModel(config=config) in the custom class , and do model = xxx.from_pretrained() outside of custom class. Maybe the part of randomly initialized weights after setting the random seed affect the training of the model？I'm not sure. Anyway, thank you very much for your answer, it really helped me a lot.

Jan 16 '25 09:01 QXGeraldMo

@NohTow Hi there, I think I found the problem. When I loaded the model using ModernBertPreTrainedModel, and I tried to customize a new classifier layer, I found that the initial weight was randomly set to 0.

class ModernBertSoftmax(ModernBertPreTrainedModel):
    def __init__(self, config):
        super(ModernBertSoftmax, self).__init__(config)
        self.model = ModernBertModel(config)
        self.num_labels = config.num_labels
        self.dropout = nn.Dropout(config.attention_dropout)
        self.classifier = nn.Sequential(
            nn.Linear(config.hidden_size * 2, config.hidden_size),
            nn.Linear(config.hidden_size, config.num_labels),
        )
        self.loss_fn = CrossEntropyLoss......
        self.init_weights()

param:classifier.0.weight tensor([[0.0000e+00, 0.0000e+00, 1.8538e-40, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], ..., [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00]], requires_grad=True) param:classifier.0.bias tensor([5.5825e-33, 0.0000e+00, 1.5203e-41, 0.0000e+00, 4.3774e+15, 4.5776e-41, 4.3774e+15, 4.5776e-41, 5.2659e-33, 0.0000e+00, 5.2659e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, ................................................

But I don't understand what is causing this problem as I haven't encountered it before

Feb 07 '25 07:02 QXGeraldMo

I am not sure you should call self.init_weights(). This is meant for initing the weights to do the pretraining, just creating the layers should be enough.

Feb 07 '25 08:02 NohTow

@NohTow Thank you for your answer. This problem is not related to self.init_weights(). Maybe the self.classifier is not initialized correctly from .from_pretrained() ?

Feb 07 '25 09:02 QXGeraldMo