ModernBertPreTrainedModel for word-level prediction tasks
Hi there! I am trying to fine-tune ModernBertPreTrainedModel to complete the task of predicting each word in a sentence.
I customized a model, but when the model class inherits ModernBertPreTrainedModel and uses the from_pretrained method to load the model, I found that the performance is completely different from when the model class directly inherits nn.Module for training (the latter is better).
However, this problem was not found on other categories of Bert based models on same task. This troubled me for some time, and I still can't figure out why this situation happened.
Thanks!
Hello,
If I get it right, you should be able to use ModernBertForMaskedLM for your task. I do not really know how you are doing the loading part, but maybe your are not properly loading the decoding head when you use the ModernBertPreTrainedModel whereas you do when using nn.Module?
@NohTow Thanks for your reply! Sorry for not making it clear. Predicting the label of each word in the whole sentence is just like sequence labeling, that's why am using ModernBertPreTrainedModel . I'll check if loading the decoder header is correct then.
@NohTow Thanks for your reply! Sorry for not making it clear. Predicting the label of each word in the whole sentence is just like sequence labeling, that's why am using ModernBertPreTrainedModel . I'll check if loading the decoder header is correct then.
@NohTow
model = ModernBert.from_pretrained(args.model_path, config=model_config, attn_implementation="flash_attention_2")
class CustomizedModel(ModernBertPreTrainedModel):
def __init__(self, config):
super(CustomizedModel, self).__init__(config)
self.encoder = ModernBertModel(config=config)
...........
self.init_weights()
model = CustomizedModel(args.model_path, config=model_config..................)
class CustomizedModel(nn.Module):
def __init__(self, model_path, config................):
super(CustomizedModel, self).__init__()
self.encoder = ModernBertModel.from_pretrained(
model_path, config=config, attn_implementation="flash_attention_2")
I found that after loading the model in these two ways, the weights of other layers except the norm layer are almost completely different. I don’t know if there is some problem with my loading method.... Thanks!
From the code I am reading, you are loading the weights correctly when doing
self.encoder = ModernBertModel.from_pretrained(
model_path, config=config, attn_implementation="flash_attention_2")
The self.init_weights() method is used to randomly init the weights, for example if you want to do pretraining.
The following code should work fine while having a class inheriting ModernBertPretrainedModel:
class CustomizedModel(ModernBertPreTrainedModel):
def __init__(self, model_path, config................):
super(CustomizedModel, self).__init__(config)
self.encoder = ModernBertModel.from_pretrained(
model_path, config=config, attn_implementation="flash_attention_2")
I think you can even do self.encoder = super().from_pretrained.from_pretrained(model_path, config=config, attn_implementation="flash_attention_2") in the class inheriting from ModernBertPreTrainedModel. Also, just keep in mind that it won't load the decoding head, just the encoder itself.
From the code I am reading, you are loading the weights correctly when doing
self.encoder = ModernBertModel.from_pretrained( model_path, config=config, attn_implementation="flash_attention_2")The
self.init_weights()method is used to randomly init the weights, for example if you want to do pretraining. The following code should work fine while having a class inheriting ModernBertPretrainedModel:class CustomizedModel(ModernBertPreTrainedModel): def __init__(self, model_path, config................): super(CustomizedModel, self).__init__(config) self.encoder = ModernBertModel.from_pretrained( model_path, config=config, attn_implementation="flash_attention_2")I think you can even do
self.encoder = super().from_pretrained.from_pretrained(model_path, config=config, attn_implementation="flash_attention_2")in the class inheriting from ModernBertPreTrainedModel. Also, just keep in mind that it won't load the decoding head, just the encoder itself.
Sorry, the code I gave above has some flaws, it is
model = CustomizedModel.from_pretrained(args.model_path, config=model_config, attn_implementation="flash_attention_2")
class CustomizedModel(ModernBertPreTrainedModel):
def __init__(self, config):
super(CustomizedModel, self).__init__(config)
self.encoder = ModernBertModel(config=config)
...........
self.init_weights()
Because I usually use self.encoder = ModernBertModel(config=config) in the custom class , and do model = xxx.from_pretrained() outside of custom class. Maybe the part of randomly initialized weights after setting the random seed affect the training of the model?I'm not sure. Anyway, thank you very much for your answer, it really helped me a lot.
@NohTow Hi there, I think I found the problem. When I loaded the model using ModernBertPreTrainedModel, and I tried to customize a new classifier layer, I found that the initial weight was randomly set to 0.
class ModernBertSoftmax(ModernBertPreTrainedModel):
def __init__(self, config):
super(ModernBertSoftmax, self).__init__(config)
self.model = ModernBertModel(config)
self.num_labels = config.num_labels
self.dropout = nn.Dropout(config.attention_dropout)
self.classifier = nn.Sequential(
nn.Linear(config.hidden_size * 2, config.hidden_size),
nn.Linear(config.hidden_size, config.num_labels),
)
self.loss_fn = CrossEntropyLoss......
self.init_weights()
param:classifier.0.weight tensor([[0.0000e+00, 0.0000e+00, 1.8538e-40, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], ..., [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00], [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00, 0.0000e+00]], requires_grad=True) param:classifier.0.bias tensor([5.5825e-33, 0.0000e+00, 1.5203e-41, 0.0000e+00, 4.3774e+15, 4.5776e-41, 4.3774e+15, 4.5776e-41, 5.2659e-33, 0.0000e+00, 5.2659e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5464e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 5.5825e-33, 0.0000e+00, 4.5465e-34, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 1.4220e-33, 0.0000e+00, 5.5825e-33, 0.0000e+00, 0.0000e+00, 0.0000e+00, ................................................
But I don't understand what is causing this problem as I haven't encountered it before
I am not sure you should call self.init_weights().
This is meant for initing the weights to do the pretraining, just creating the layers should be enough.
@NohTow Thank you for your answer. This problem is not related to self.init_weights(). Maybe the self.classifier is not initialized correctly from .from_pretrained() ?