Fine tuning ProtBert model for language modelling task
I want to fine tune protbert model on some protein data for language modelling task say given a protein sequence "ACDRFG" if I replace D with mask token and try to predict it. I am not able to figure it out from given github codes. Can anyone help with it?
Hm, there are a bunch of fine-tuning tutorials for BERT in huggingface which should work for you: https://huggingface.co/course/chapter7/3?fw=tf
What you probably need to do: split your sequence into our notion of tokens, e.g. "ACDRFG" should become "A C D R F G" and in a next step you probably want to replace some input tokens by the [MASK] special token (check the original BERT-publication on details; 15% of the tokens are masked, some get replaced by their identity some by a random other AA; you probably want to stick to this logic). So you should end up with input that looks like this: "A C [MASK] R F G". Then you retrieve the output of the LM-head, feed it to the loss function and backprop the loss as shown in the tutorial linked above.
@mheinzinger Thanks for super quick reply.
- I could not find protbert model in the above mentioned url.
- Actually I checked performance of protbert pretrained model on some protein sequences that looks to be working great. So I want to fine tune protbert in particular with different proteins data. Everywhere finetuning of protbert model is mentioned for downstream tasks(which is not my objective).
- Alternatively, where can I find architecture of protbert model in detail so that I can tweak it according to my case. Please advise.
- Yup, that's a general huggingface tutorial but you can use it with the minor modifications I had mentioned above without any problems.
- I am not aware of many people fine-tuning ProtBERT. It gives good performance when used solely as feature extractor. But sure: give it a shot.
- Architecture is exactly as described in ProtTrans. For details on the specifics of BERT, please check the corresponding paper.
Last but not least: I would absolutely always use ProtT5, never ProtBERT. In our hands, ProtT5 outperformed ProtBERT in all benchmarks significantly.
@mheinzinger I was not able to get training code of ProtBert model. Like it is mentioned that there are total 30 layers used in ProtBert architecture but what are those layers that I want to look at. Can you help me with that. Mainly I want to fine tune that Bert Architecture on my protein data. Please advise.
Those are attention-layers. We just used the normal BERT architecture and did not modify it at all. All we did is setting hyperparameters (s.a. number of layers etc). Beyond that I think this fine-tuning notebook should have all you need. You just need to replace the task by your use-case (probably you want to modify/mask the input accordingly if you really want to got for MLM):
https://github.com/agemagician/ProtTrans/issues/74#issuecomment-1120174837
@mheinzinger While fine tuning Protbert model on my dataset, I am facing some small small issues. Can I get Training code so that my confusions are cleared.
Sorry I can not provide you any more details than re-directing you to existing/published notebooks/tutorials that show how to fine-tune Prot(BERT). Nevertheless, good luck with your project! - I am sure you'll be able to resolve this small issue yourself :)