torchtune Understanding contents of the final checkpoint file

Hello,

I managed to get a llama3 lora training run with torchtune, after which I got these logs:

INFO:torchtune.utils.logging:Model checkpoint of size 16.06 GB saved to /tmp/Meta-Llama-3-8B-Instruct/meta_model_0.pt
INFO:torchtune.utils.logging:Adapter checkpoint of size 0.21 GB saved to /tmp/Meta-Llama-3-8B-Instruct/adapter_0.pt

My ultimate goal is to be able to use this with HF Auto* classes. (I realise this was discussed in some depth in this issue) I went through the docs and saw this:

The final trained weights are merged with the original model and split across two checkpoint files similar to the source checkpoints from the HF Hub (see the LoRA Tutorial for more details). In fact the keys will be identical between these checkpoints. We also have a third checkpoint file which is much smaller in size and contains the learnt LoRA adapter weights. For this tutorial, we’ll only use the model checkpoints and not the adapter weights.

I've only worked with hf apis so far. Does The final trained weights are merged with the original model mean the original model combined with adapter_weights? aka what we would get with hf's merge_and_unload? Or is that something else?

Thank you!

Apr 26 '24 07:04 man-shar

Thanks for opening this issue!

The final trained weights are merged with the original model mean the original model combined with adapter_weights?

Yes this is right - before the writing the checkpoint, we merge back the lora weights into the base model and then output the checkpoints. You shouldn't need the adapter weights for anything other than resuming training. We're working on adding compatibility for this with HF PEFT and will share more on this soon.

Apr 26 '24 19:04 kartikayk

Understood, thanks much! Will close this now.

Apr 28 '24 18:04 man-shar