Specification about DMR.csv
Hi @hanyangii, Could you share more details about the preparation of the DMR.csv? I tried to fine-tune based on a dorado called BAM file, but the fine-tune doesn't seem to work:
methylbert preprocess_finetune --methylcaller dorado --input_file with_cell_type.labeled.bam --f_dmr $BED --f_ref $REF --split_ratio 0.8 --n_cores 23 -o methylbert.test/
MethylBERT v2.0.2
Could not find any statistics to sort DMRs
Number of DMRs to extract sequence reads: 220
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.53s/it]Fine-tuning data generated: name flag ref_name ref_pos map_quality cigar ... CT dna_seq methyl_seq dmr_ctype dmr_label ctype
0 23afdc5b-12f3-4e16-8ba1-bb8c42a21a51 0 chr6 50851183 60 713M ... prostate_epithelial AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG... 2202222222222222222222222222222222222220222222... T 170 NA
[1 rows x 46 columns]
Total sequences per cell type
ctype
NA 1
Name: count, dtype: int64
Traceback (most recent call last):
File "/home/4470655/.conda/envs/methylbert/bin/methylbert", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 313, in main
run_preprocess(args)
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 238, in run_preprocess
finetune_data_generate(f_dmr=args.f_dmr,
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/data/finetune_data_generate.py", line 384, in finetune_data_generate
train_files, test_files = train_test_split(
^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2780, in train_test_split
n_train, n_test = _validate_shuffle_split(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2410, in _validate_shuffle_split
raise ValueError(
ValueError: With n_samples=1, test_size=0.19999999999999996 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.66s/it]
Here is my $BED. It was a DMR based on tumor normal paired comparison. Does ctype means cell type as your example shows?
head $BED
chr start end ctype
chr1 828727 829648 T
chr1 19361106 19361591 T
chr1 37734952 37735434 T
chr1 74543166 74544241 T
chr1 87151545 87152349 T
chr1 91736103 91737407 T
chr1 106081022 106081218 T
chr1 121019929 121021641 T
chr1 156845668 156845999 T
Thanks!
Hello @Yijun-Tian
Thank you very much for your interest in MethylBERT.
It seems like there's only one sequence read found overlapping with given DMRs in your BAM file. This is why n_samples=1 occurred in the error message. The overlapping one read is shown in the error message as well:
0 23afdc5b-12f3-4e16-8ba1-bb8c42a21a51 0 chr6 50851183 60 713M ... prostate_epithelial AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG... 2202222222222222222222222222222222222220222222... T 170 NA
[1 rows x 46 columns]
Total sequences per cell type
ctype
NA 1
Name: count, dtype: int64
I think you need a better quality of DMRs/BAM file to find more overlapping reads.
Kind regards, Yunhee
Thanks for the feedback @hanyangii . The BAM file provided to the MethylBERT program was generated by overlapping between a larger WGS BAM and the BED file, so I don't think it will be the case that the alignments are too few in the DMR region. By the way, except for the region overlapping, does MethylBERT set other alignment filtering procedures, such as MAPQ, CIGAR operator? As far as I know, most of my nanopore alignments have lots of indels, which could make them filtered out?
Hello, @Yijun-Tian. MethylBERT does not have any further filtering steps in the algorithm. However, it does select only reads fully-overlapping with DMRs. Therefore, there's a possibility that the length of most long-reads exceeds the region size (several hundreds bps) thus have been disregarded. Could you please increase the region size by adding 1000 bps (or more depending on your read length) at the start and the end of each region, and then try again?