methylbert Issue with Tumor Fraction Prediction Using Pre-trained MethylBERT

As described in the data preparation tutorial, fine-tuning the MethylBERT model with pure tumor and normal samples is optional. So I used the pre-trained model from https://huggingface.co/hanyangii/methylbert_hg19_12l to directly predict plasma samples, but it failed to detect tumor signals, meaning the tumor fraction results were all 0 in all samples. Did I choose the wrong pre-trained model, or do I have to fine-tune it with cancer and normal tissue samples to detect the tumor fraction in plasma samples?

Mar 27 '25 03:03 LiJingqi7

Dear @LiJingqi7

Thank you for your interest in MethylBERT.

Although it's written as optional in the tutorial, in your case, you need pure tumour and normal samples for fine-tuning. It'd be helpful for you to understand the pipeline if you read our paper . In the Method section, the pipeline is described in more detail.

Please let me know if the model still does not work for you after fine-tuning.

Mar 27 '25 09:03 hanyangii

Thank you for getting back to me. I have followed the fine-tuning process using pure tumor and normal samples as suggested. Specifically, I fine-tuned the model using liver cancer tissue samples and their matched normal tissue samples and then used it to predict the tumor fraction in plasma samples. However, after fine-tuning, the model still fails to detect tumor signals, with the predicted tumor fraction remaining at 0. Could you provide any insights into what might be causing this issue?

Apr 02 '25 03:04 LiJingqi7

Hello @LiJingqi7

Sorry for my late reply. This sounds weird to me. Can you share more information about your fine-tuned model?:

train, valid accuracy
approximated number of reads in the training and a plasma sample.

You can try the estimation without adjustment option and see if the result looks better. Depending on the quality of selected DMRs, it could be the case that the adjustment option hinders an accurate estimation.

Apr 17 '25 06:04 hanyangii

Dear @hanyangii, A total of 23 paired normal and liver cancer tissue samples were used to fine-tune the model. test_seq.csv contains 240,000 reads, and train_seq.csv contains 800,000 reads. Details of the trained model are provided in the files listed below. Approximately 3265 reads from plasma samples（~3x） were used as input for prediction. Could you please help me check what might be causing the inaccuracy in tumor fraction prediction? The results are shown in the attached merged_deconvolution.csv file. fine-tuned model eval.csv fine-tuned model train.csv train_param.txt

merged_deconvolution.csv

Jun 03 '25 04:06 LiJingqi7