d4-format icon indicating copy to clipboard operation
d4-format copied to clipboard

New stat function: % of positions in intervals with coverage >X

Open Jakob37 opened this issue 1 year ago • 9 comments

This is something we are looking at over at the https://github.com/Clinical-Genomics/chanjo2 repo.

The current approach is to read the output from the d4 file using d4tools show. It would be much quicker to do this directly in d4tools.

The execution could be something like:

$ d4tools stat --stat perc_cov "10,20,30" --region "intervals.bed"

The desired output would be something like:

chr     start           end             10x  20x   30x
19      31030023        32034023        98.3 96.3  94.1
19      49624990        49625000        94.2 90.3  85.3
19      49625010        49625020        88.5 80.3  75.3

Feedbacks on this are welcome. If I have the time, and no one else jumps on it, I'll give it a go and see if I can implement it.

Jakob37 avatar Jun 14 '24 15:06 Jakob37

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

cademirch avatar Jun 17 '24 06:06 cademirch

Funnily enough I've been working on something related to this using the d4 crate, though I hadn't considered implementing it within d4tools show. Unfortunately I haven't had time lately to finish it up.

Your idea should be relatively straightforward to implement using d4's Tasks. I'd be happy to take a stab when I have some more free time in the coming weeks. Also can chip in if you've already started!

This is great to hear @cademirch I haven't gotten started with this yet. This would be extremely useful for us, as it would resolve our (hopefully) final performance bottleneck before we could use this in production.

I would be happy to help out with testing or in any other ways that might be helpful, just let me know!

Jakob37 avatar Jun 17 '24 07:06 Jakob37

@Jakob37 Ended up having some free time today and took a crack at this here: https://github.com/cademirch/d4-format/tree/perc_cov

I think it works as expected... I tested a little on some random d4 files:

d4tools show -R regions.bed input.d4:

JANIGQ010000002.1       0       5       0
JANIGQ010000002.1       5       29      1
JANIGQ010000002.1       29      80      2
JANIGQ010000002.1       80      100     3

Then, d4tools stat -H -s perc_cov=1,2,5 -r regions.bed input.d4:

#Chr    Start   End     1x      2x      5x
JANIGQ010000002.1       0       100     0.940   0.700   0.000

Currently this wont work with multi-track d4 files, so just fyi. Also, I'm super new to Rust so this is probably not merge worthy. Lmk what you think!

cademirch avatar Jun 18 '24 01:06 cademirch

Nice @cademirch , I'll take a look during the day!

Jakob37 avatar Jun 18 '24 05:06 Jakob37

I am trying this out. It looks great!! Our previous way of processing 74 intervals with three thresholds took ~15 seconds. With your function it takes only 0.5s.

The results looks identical to the ones we produces as well. So it looks like it is both quick and doing what it should. Nice job 💪

Jakob37 avatar Jun 19 '24 06:06 Jakob37

It would be great to get this one into d4tools in some way actually. I looked at the diff, and the code looks well organized, but I am unfortunately not experienced in Rust either.

Still, could you put this up as a PR? And then we could see if anyone else could take a look at it.

Jakob37 avatar Jun 19 '24 06:06 Jakob37

Hi again! What do you think about making a PR of this @cademirch ? It worked fine for me when testing it some weeks ago. I can help out with further testing as well.

This is something we really need, so I am thinking about how we could drive this forward.

Jakob37 avatar Jul 23 '24 04:07 Jakob37

@Jakob37 Thanks for the reminder. Sorry I fell off this - got caught up with some deadlines. I can open a PR tomorrow, would be great to add some tests!

cademirch avatar Jul 23 '24 04:07 cademirch

@Jakob37 Thanks for the reminder. Sorry I fell off this - got caught up with some deadlines. I can open a PR tomorrow, would be great to add some tests!

No worries, sounds great! Yes, I can do both some manual testing and look into how to unit test here .. Will look into it when the PR is up.

I am in office this week, away for two weeks, and then back again.

Jakob37 avatar Jul 23 '24 06:07 Jakob37

Thank you for the great work @cademirch . Much appreciated 🙏 Looks like we soon will have this in production. I'll close this one now.

Jakob37 avatar Sep 26 '24 12:09 Jakob37

Thank you for the great work @cademirch . Much appreciated 🙏 Looks like we soon will have this in production. I'll close this one now.

No problem glad we got it sorted. Thank you for helping figure out those bugs!!

cademirch avatar Sep 26 '24 15:09 cademirch