methylbert Specification about DMR.csv

Hi @hanyangii, Could you share more details about the preparation of the DMR.csv? I tried to fine-tune based on a dorado called BAM file, but the fine-tune doesn't seem to work:

 methylbert preprocess_finetune --methylcaller dorado --input_file  with_cell_type.labeled.bam --f_dmr $BED --f_ref $REF --split_ratio 0.8 --n_cores 23 -o methylbert.test/
MethylBERT v2.0.2
Could not find any statistics to sort DMRs
Number of DMRs to extract sequence reads: 220
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.53s/it]Fine-tuning data generated:                                    name flag ref_name   ref_pos map_quality cigar  ...                   CT                                            dna_seq                                         methyl_seq dmr_ctype dmr_label ctype
0  23afdc5b-12f3-4e16-8ba1-bb8c42a21a51    0     chr6  50851183          60  713M  ...  prostate_epithelial  AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG...  2202222222222222222222222222222222222220222222...         T       170    NA

[1 rows x 46 columns]
Total sequences per cell type
ctype
NA    1
Name: count, dtype: int64
Traceback (most recent call last):
  File "/home/4470655/.conda/envs/methylbert/bin/methylbert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 313, in main
    run_preprocess(args)
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 238, in run_preprocess
    finetune_data_generate(f_dmr=args.f_dmr,
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/data/finetune_data_generate.py", line 384, in finetune_data_generate
    train_files, test_files = train_test_split(
                              ^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2780, in train_test_split
    n_train, n_test = _validate_shuffle_split(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2410, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=1, test_size=0.19999999999999996 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.66s/it]

Here is my $BED. It was a DMR based on tumor normal paired comparison. Does ctype means cell type as your example shows?

head $BED
chr     start   end     ctype
chr1    828727  829648  T
chr1    19361106        19361591        T
chr1    37734952        37735434        T
chr1    74543166        74544241        T
chr1    87151545        87152349        T
chr1    91736103        91737407        T
chr1    106081022       106081218       T
chr1    121019929       121021641       T
chr1    156845668       156845999       T

Thanks!

Apr 23 '25 13:04 Yijun-Tian

Hello @Yijun-Tian

Thank you very much for your interest in MethylBERT.

It seems like there's only one sequence read found overlapping with given DMRs in your BAM file. This is why n_samples=1 occurred in the error message. The overlapping one read is shown in the error message as well:

0  23afdc5b-12f3-4e16-8ba1-bb8c42a21a51    0     chr6  50851183          60  713M  ...  prostate_epithelial  AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG...  2202222222222222222222222222222222222220222222...         T       170    NA

[1 rows x 46 columns]
Total sequences per cell type
ctype
NA    1
Name: count, dtype: int64

I think you need a better quality of DMRs/BAM file to find more overlapping reads.

Kind regards, Yunhee

Apr 24 '25 13:04 hanyangii

Thanks for the feedback @hanyangii . The BAM file provided to the MethylBERT program was generated by overlapping between a larger WGS BAM and the BED file, so I don't think it will be the case that the alignments are too few in the DMR region. By the way, except for the region overlapping, does MethylBERT set other alignment filtering procedures, such as MAPQ, CIGAR operator? As far as I know, most of my nanopore alignments have lots of indels, which could make them filtered out?

Apr 24 '25 14:04 Yijun-Tian

Hello, @Yijun-Tian. MethylBERT does not have any further filtering steps in the algorithm. However, it does select only reads fully-overlapping with DMRs. Therefore, there's a possibility that the length of most long-reads exceeds the region size (several hundreds bps) thus have been disregarded. Could you please increase the region size by adding 1000 bps (or more depending on your read length) at the start and the end of each region, and then try again?

Apr 28 '25 11:04 hanyangii