methylbert icon indicating copy to clipboard operation
methylbert copied to clipboard

About read classificaiton of methylbert

Open guanghaoli opened this issue 9 months ago • 8 comments

Hi,

I'm trying to apply MethylBERT in my WGBS data, but some problems occurred. I fintuned the methylbert_hg19_4l and used it to distinguish tumor from normal sequence in a test dataset through the read_classification function. However, the prediciton results (column name "pred") are all 0, even through the number of real tumor and normal sequences are more than 20,000.

The detail of the finetuned model and test results are showed below. Could you please help me to solve the problem?

train.csv of the finetuned model:

step loss ctype_acc lr

0 0.5453637838363647 0.93 3.3333333333333335e-05

1 0.5532763600349426 0.95 6.666666666666667e-05

2 0.13309021294116974 1.0 0.0001

3 0.06435760110616684 1.0 0.0001

4 0.046859435737133026 1.0 0.0001

5 0.0403900146484375 1.0 0.0001

6 0.03647613525390625 1.0 0.0001

7 0.03456985577940941 1.0 0.0001

8 0.03329452499747276 1.0 0.0001

9 0.03195953369140625 1.0 0.0001

The loss of the model is small, and cell type acc is 1.0.

Tests for the finetuned model:

test_pre,logit = trainer.read_classification(test_loader,tokenizer,logit=True)
 test_pre["ctype"].value_counts()

ctype

T 75747

N 24252

Name: count, dtype: int64

test_pre['ctype_label'].value_counts()

ctype_label

0 99999

Name: count, dtype: int64

test_pre['pred'].value_counts()

pred

0 99999

Name: count, dtype: int64

The number of sequences for tumor and normal are 75747 and 24252, respectively. However, the two columns generated by read_classification, "pred" and "ctype_lable", are both all 0.

guanghaoli avatar Apr 29 '25 00:04 guanghaoli

Hi, I am also facing the same issue. I was wondering if you have found a solution to this problem. I would greatly appreciate it if you could share any insights or suggestions. Thank you very much!

Zhaooooooooooo avatar May 06 '25 03:05 Zhaooooooooooo

Hi, I just identified the source of the problem. You can check the content of each batch after running MethylBertFinetuneDataset — you’ll notice that there is an additional column called ctype_label. This label is generated by the _parse_line function inside the dataset class, specifically through the line:

l["ctype_label"] = int(l["ctype"] == l["dmr_ctype"])

This means ctype_label will be 1 only when ctype and dmr_ctype are the same; otherwise, it will be 0. So if the dmr_ctype field is empty or inconsistent in your training data, ctype_label will always be 0, which renders it meaningless during training.

A reasonable workaround would be to assign all dmr_ctype values as "T" in your training data. This way, the generated binary ctype_label will correctly correspond to the original ctype field.

That’s my current understanding, and I’m still testing it — please feel free to correct me if I’m wrong.

Zhaooooooooooo avatar May 06 '25 04:05 Zhaooooooooooo

Hi, I checked the code of MethylBertFinetuneDataset function, and had the same opinion with you! The dmr_ctype column was defined by myself before. I'm trying to modify the dmr_ctype of my training data and finetune the model again to test the result.
Thank you very much!

guanghaoli avatar May 07 '25 10:05 guanghaoli

Hi, I just got the results of the finetune model and read classification. The ctype and ctype_label are matched, while the predictions of the model were still all 0. The detail of the finetuned model and test results are showed below.

step loss ctype_acc lr 0 0.8156836032867432 0.2 3.3333333333333335e-05 1 0.8056689500808716 0.26 6.666666666666667e-05 2 0.4977148473262787 0.81 0.0001 3 0.5285412669181824 0.78 0.0001 4 0.5302502512931824 0.78 0.0001 5 0.45050048828125 0.85 0.0001 6 0.6046484112739563 0.72 0.0001 7 0.558911144733429 0.75 0.0001 8 0.624957263469696 0.7 0.0001 9 0.4960571229457855 0.82 0.0001

test_pre,logit = trainer.read_classification(test_loader,tokenizer,logit=True)
 test_pre["ctype"].value_counts()

ctype T 75748 N 24252 Name: count, dtype: int64

test_pre['ctype_label'].value_counts()

ctype_label 1 75748 0 24252 Name: count, dtype: int64

test_pre['pred'].value_counts()

pred 0 100000 Name: count, dtype: int64

The column 'pred' are all 0, which is quite different from the 'ctype' column.

guanghaoli avatar May 08 '25 08:05 guanghaoli

Hello @guanghaoli, thank you very much for your interest and sorry for my late reply. I am not working at DKFZ anymore, so it was a bit hard to find time to properly go through your issue.

For me, your model training does not seem successful despite the high accuracy. This can happen when your training data has a class imbalance (meaning there are much more reads from normal than from tumour, or vice versa). I already do see the class imbalance in your test data. There are almost four times more tumour reads.

If this is the case, I can suggest two options:

  1. You can just make a balance between tumour and normal reads by randomly selecting a similar number of reads in the tumour cell type. This is not an elegant solution but can be a sanity check to see if the training can be achieved when the problem is solved.
  2. Afterwards, you can set --loss focal_bce for methylbert finetune. This makes the model use the focal loss function which is known to relieve the class imbalance problem and use the full training data set.

If these solutions do not work for you, please get back to me again!

Kind regards, Yunhee

hanyangii avatar May 15 '25 09:05 hanyangii

Hello, @hanyangii thanks for your suggestions. I have tried option 1, and the problem was solved a little bit. The predicted value are not all 1. The results was showed as below:

I fine tuned the model using 10,000,000 tumor reads, and 10,000,000 normal reads in traning set, and 4,000,000 reads in test set. step loss ctype_acc lr 0 0.717211902141571 0.44 3.3333333333333335e-05 1 0.6981591582298279 0.5 6.666666666666667e-05 2 0.8779943585395813 0.55 0.0001 3 0.9688439965248108 0.44 0.0001 4 0.8317468166351318 0.54 0.0001 5 0.6844872832298279 0.62 0.0001 6 0.6762402057647705 0.59 0.0001 7 0.6993408203125 0.49 0.0001 8 0.7033789157867432 0.44 0.0001 9 0.6993700861930847 0.49 0.0001 10 0.6838232278823853 0.58 0.0001 11 0.7040331959724426 0.44 0.0001 12 0.6949706673622131 0.47 0.0001 13 0.6886376738548279 0.57 0.0001 14 0.6899462938308716 0.56 0.0001 15 0.6890478134155273 0.51 0.0001

The additional test dataset contained 100,000 reads, including 51,018 tumor and 48,982 normal reads.

test_pre,logit = trainer.read_classification(test_loader,tokenizer,logit=True)
 test_pre["ctype"].value_counts()

ctype ctype T 51018 N 48982 Name: count, dtype: int64

test_pre['ctype_label'].value_counts()

ctype_label ctype_label 1 51018 0 48982 Name: count, dtype: int64

test_pre['pred'].value_counts()

pred 1 96567 0 3433 Name: count, dtype: int64 The acc is 0.52457.

We still cannot fine tune the model appropriately with this training set. I prepared the dataset by referring to the method section of the paper and tutorial of the github. I wondering what can I do to improve it? There are some options: (1) using more tumor and normal reads in the training set; (2) Filtering DMRs more strictly according to the methylation level or areaStat. Which do you think is feasible, or you have other suggestions? Thanks!

guanghaoli avatar May 21 '25 14:05 guanghaoli

Hello @guanghaoli

Glad to hear that my suggestion slightly improved your situation. How many steps did you use for training? I have a feeling that the entire training process may have not gone through your data set sufficiently, as you have much more reads than I had for the MethylBERT data. If you used the same number of steps as MethylBERT paper, it must be too little for your data set.

If you want to quickly validate if MethylBERT runs without any problem, you can run MethylBERT with a much smaller set (e.g., choosing only top-10 DMRs) and see the result. Then, at least, you are sure that some DMRs in your data set have clear methyl difference between tumour and normal for MethylBERT to learn the patterns.

Kind regards, Yunhee

hanyangii avatar May 27 '25 09:05 hanyangii

Hello @hanyangii

Thanks for your suggestions! Due to the large number of reads, I just fine-tune the model in 100 steps previously. Recently, I tried to update some parameters to accelerate the training process and fine-tuned the model in more steps. The log file shows a better accuracy. I'm training this process to see the results and thanks again!

Best, Guanghao

guanghaoli avatar May 29 '25 14:05 guanghaoli

Hello @guanghaoli

Glad to hear that you got better results! :) Please feel free to reach out to me by opening an issue when you have any further questions/problems.

Kind regards, Yunhee

hanyangii avatar May 30 '25 14:05 hanyangii