modkit Calling 5hmC

Hi!

I'm interested in calling 5hmC throughout CpGs across the genome, and I wanted to clarify how best to do that. At this stage, I'd like to focus solely on 5hmC to the exclusion of 5mC. I'd thought the best way to do that would be to include '--ignore m', such that bases could be called only as 5hmC or unmethylated. I think that I've perhaps misunderstood this, particularly as all CpGs in my results are now called as 'h'.

What would be the best way to do go about this? Should I run without the --ignore flag entirely? When focussing only on 5hmC, I'd rather 5mC calls were either identified as such, or as 'unmethylated'.

Many thanks, Kevin

May 22 '24 14:05 KevDonn

Hello @KevDonn,

Here are 2 ways to aggregate 5hmC calls: (both ways require you to run modkit pileup first to generate a bedMethyl file)

Filter to positions only reporting 5hmC:

$ awk '$4=="h"' ${pileup_bed}

You can then decide on a threshold value, say 50%, and decide that positions with >50% 5hmC are called as 5hmC. Keep in mind that rows with 5hmC %-modified less than 50% are not canonical, they could be 5mC or canonical. You may also decide to filter on the valid-coverage column (because one read calling a 5hmC will result in that position being 100% 5hmC). Here I've decided to only keep bedMethyl records with valid-coverage greater than 5.

$ awk '($4=="h")&&($5>5)' ${pileup_bed}

You can use awk to tabulate the number of 5hmC calls in the sample (you could do something similar with pandas or polars):


$ awk '$4=="h" {can+=$13; mod+=$12; valid+=$10} END{print (can/valid) " CpG canonical\n" (mod/valid) " 5hmCpG modified"}' ${pileup_bed}

This command will calculate the fraction of all calls that are 5hmC and canonical.

Hope this helps, if you give me more details maybe I can help you more.

May 22 '24 15:05 ArtRand

Hi @ArtRand,

Thanks for responding so quickly - yes, this is very helpful. My current command looks like this:

modkit extract \ --include-bed $TARGETBED \ --threads 4 \ --reference $REF \ --mapped-only --cpg \ --sampling-frac 1.0 \ --seed 12345 \ --edge-filter 40,0 \ --log-filepath $SAMPLE.5hmC.alpha.log \ --read-calls $SAMPLE.5hmC.readCalls.txt \ $INPUTBAM \ $SAMPLE.5hmC.alpha.txt

I've removed the '--ignore' flag entirely. The bed file contains a set of CpG sites from across the genome. The output file '$SAMPLE.5hmC.readCalls.txt' seems to contain something much more akin to what I'm looking for, in that the 'call_code' field now contains a mixture of '-', 'm', and 'h'. I'd be entirely happy to consider 5hmC vs the other groups. Does this seem like a sensible way to proceed?

Thanks again, Kevin

May 22 '24 16:05 KevDonn

Hello @KevDonn,

Your command looks fine, however something to consider is that modkit extract is intended for read-level interrogation, so you get one row per position per read. As a result, these tables can be very large. Given that you're using --reference, --cpg, and --include-bed, you may consider using modkit pileup instead. The pileup command supports the same functionality but will make more concise tables that are compatible with entropy, dmr etc. One way I like to work is perform modkit pileup, use dmr or just some basic exploratory data analysis on the table and find some regions that are interesting. If I want to interrogate the reads directly, I'll use modkit extract on regions I want to look at more closely.

May 22 '24 23:05 ArtRand

Thanks @ArtRand,

Read-level info is indeed what we're after, cumbersome as it is! Thanks again for your advice - we may run a few different approaches, and so it's helpful to know what our options are.

May 23 '24 10:05 KevDonn