ModernBERT icon indicating copy to clipboard operation
ModernBERT copied to clipboard

checkpoints for further pretraining

Open ZanezZephyrs opened this issue 1 year ago • 8 comments

Hi, congratulations on the awesome work.

I was planning on further pretrain your models in another languages, since your version focus mainly in English. However, I am having trouble starting a train ( from main.py) initializing the model weights with the ones from moderbert-base. Are the composer checkpoints available somewhere? or maybe it is possible to start a train in composer using the weights at hugginface in some way? I would greatly appreciate any guidance regarding that.

ZanezZephyrs avatar Dec 22 '24 01:12 ZanezZephyrs

Hello,

We are planning to also release intermediate checkpoints (that could be more appropriate for your needs, especially the pre-decay ones) in early January (right now the team is resting and bit and scattered due to vacations).

The HF checkpoints are derived from the composer checkpoints through a composer conversion function (write_huggingface_pretrained_from_composer_checkpoint), but I am not sure this is properly doable the other way around (especially for training states). I will have a look as whether it is doable and, if not, also release the composer checkpoints one way or another!

NohTow avatar Dec 22 '24 10:12 NohTow

Hi @NohTow, Thanks for amazing contribution... Any updates on this?

amishparekh avatar Jan 02 '25 09:01 amishparekh

Hey,

Unfortunately no, as mentioned in the previous answer, the team took Christmas vacations after the release and most of us are still in vacation. But be assured that releasing all the checkpoints is planned and we'll make sure it is a priority when we come back. Sorry for the delay.

NohTow avatar Jan 02 '25 09:01 NohTow

@NohTow , Thats perfectly fine... Thanks for the update..

amishparekh avatar Jan 02 '25 21:01 amishparekh

Hey @NohTow, thanks for the great work. Is there an update on the timeline when the checkpoints will be released ? :)

raphaelreimann avatar Jan 13 '25 13:01 raphaelreimann

Hello,

We just had a meeting where we discussed the different things to do to enable reproduction and further pre-training, which include releasing Composer checkpoints, configs, making sure everything run smoothly and adding proper documentation for people to run.

This should be done soon, I am once again sorry for the delay, we had a lot of things to do lately!

NohTow avatar Jan 13 '25 15:01 NohTow

Hi, would be awesome if you could release (at least) the final checkpoints for the versions of the model already on huggingface so I could use the code for further finetuning!

kamilelukosiute avatar Feb 12 '25 21:02 kamilelukosiute

We uploaded the Composer training checkpoints for both ModernBERT-base and ModernBERT-large to Hugging Face this week. Will add instructions on how to use them over the next few days.

warner-benjamin avatar Feb 13 '25 00:02 warner-benjamin

I'm attempting to use these checkpoints as Composer training checkpoints, but Composer hangs when loading them.

My yaml file has: load_path: /workspace/checkpoints/modernbert_large_context_extension/ep0-ba49552-rank0.pt

I downloaded that checkpoint from the link above for ModernBERT-large.

Then I run $ composer main.py my-config.yaml

I set the logging level of the Composer Trainer to DEBUG, and I see it hangs here during instantiation of the Trainer:

2025-12-04 16:53:52,619: rank0[10034][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2025-12-04 16:53:52,620: rank0[10034][MainThread]: DEBUG: composer.utils.checkpoint: Loading checkpoint at /workspace/checkpoints/modernbert_large_context_extension/ep0-ba49552-rank0.pt

I set the timeout to 1 hr, and the model loading never completes. Eventually, I get a timeout error.

Is this file in the wrong format from which to load a checkpoint in Composer?

My goal is simply to continue pre-training the ModernBERT-large model with a domain-specific corpus. I thought the recommended approach for doing this was using the codebase in this repo.

jmcmanus15 avatar Dec 04 '25 22:12 jmcmanus15

I figured out my issue. The checkpoints cited above are not suitable for the composer Trainer, even if one sets load_weights_only to True.

There are two distinct code paths for restoring from a previous checkpoint, with corresponding options that enable them in the yaml configs (init_from_checkpoint and load_path).

One must use the init_from_checkpoint mechanism, although this requires fixing a small bug in the code. (The bug is a "<=" operator that should be "<" in an assertion.)

I found it necessary to inspect the contents of the checkpoint and to read the code carefully. What little documentation is available here is just as likely to mislead as it is to help.

jmcmanus15 avatar Dec 05 '25 23:12 jmcmanus15

There are two distinct code paths for restoring from a previous checkpoint, with corresponding options that enable them in the yaml configs (init_from_checkpoint and load_path).

Hey,

Both options are for different things. init_from_checkpoint was created to init large from base with tilling load_path should be usable without issue, your issue of iddling is probably due to spinning, see #246.

As your goal is to continue the training on another dataset, you do not have to do spinning. You can use

load_path: checkpoints/modernbert-base-context-extension/context-extension/ep0-ba52988-rank0.pt
autoresume: false
reset_time: true # restarts the scheduler, dataloaders, etc from step zero
restart_override: true # resets optimizer hyperparameters (LR, WD, etc), LR Scheduler, and training microbatch size from the checkpoint's values

This blogpost gives a lot of details on how to successfully continue the training using those checkpoints.

NohTow avatar Dec 10 '25 09:12 NohTow