feat: Add percent coverage option to stat
This adds the feature requested in #81 by @Jakob37. I tried my best to emulate the existing code in stat, but would of course appreciate any feedback.
Not sure if you're accepting new features @brentp, happy to discuss more.
I agree tests are needed. @Jakob37 mentioned they can help with tests for this, and we can setup some actions to run them.
@Jakob37 could you possibly provide a small d4 file and regions.bed along with expected output?
Great, thank you for putting together this PR.
@Jakob37 could you possibly provide a small d4 file and regions.bed along with expected output?
I'll look into it!
Doing some initial testing. I have prepared a small d4 file from the hg002 GIAB dataset where I have cut out one gene.
I run it as such:
$ <home>/d4-format-cademirch/target/debug/d4tools stat -H -s perc_cov=10,20,30,40,50 -r hg002_testdata/test_ranges.bed hg002_testdata/hg002_19_range.d4
#Chr Start End 10x 20x 30x 40x 50x
19 45769552 45769606 0.574 0.056 0.000 0.000 0.000
19 45769552 45770149 1.372 1.191 0.936 0.541 0.295
19 45769767 45769811 1.000 1.000 1.000 1.000 0.977
The content of test_ranges.bed is following. First one is a region with low coverage, second is the full range and third is from a high coverage region.
19 45769552 45770149
19 45769552 45769606
19 45769767 45769811
Looks like some strange values here. Coverage should never be above 1 isn't it?
19 45769552 45770149 1.372 1.191 0.936 0.541 0.295
I attach the d4 file, the bed file and a file with the coverage as seen by d4tools show.
Will continue digging around a bit, and calculate the real expected values.
I am a bit confused on how ranges are handled. When I run d4tools show it looks zero indexed, and it looks like the end of each position is the same as the start of the next. I.e. as how ranges are handled in Python.
In the bed file on the other hand, I would expect things to be one-indexed and having "inclusive" ends.
I.e. if having the ranges in the bed file 1 - 3 this would mean the first three positions. And if I want to show the same in Python, it would be 0 - 3.
Is this correct?
Based on this, running just the two very narrow ranges (one indexed bed file):
19 45769557 45769559
And based on the coverage data (first lines in the provided d4 file, zero indexed with exclusive ends)
19 45769552 45769557 1
19 45769557 45769563 2
I would expect the coverage 0.667, but I get 1.
Also, on the second range (one indexed bed file):
19 45769557 45769562
I would expect 0.8 in my testing (i.e. 4 of them falling within the second range), but I get 1.4 when running the PR.
Have to run, I can see if I can clarify this tomorrow.
Thank you for putting those datasets together and effort testing @Jakob37! I've always understood that bed coordinates should be 0 based and half open.
I'm also going on vacation, but will be back on Tuesday and will take a look at this more then!
Thank you for putting those datasets together and effort testing @Jakob37! I've always understood that bed coordinates should be 0 based and half open.
Aha, then I learned something. It looks like you are right. I'll update the range coordinates above then.
I'm also going on vacation, but will be back on Tuesday and will take a look at this more then!
Sounds great. Enjoy your vacation!
Aha, then I learned something. It looks like you are right. I'll update the range coordinates above then.
0 vs 1 based genomic coordinates is the most annoying problem 🙄
Sounds great. Enjoy your vacation!
Thank you, you too!
I am back in office now. I am happy to help out further with testing / providing testing data @cademirch . Just let me know what you need. Excited to get this in, as it will mean that we can start using Chanjo2 in production 😄
Hi @cademirch, I really appreciate this new feature! Good stuff!🥇 Perhaps in this PR you could update the README file to include the new command before merging, so the changes reflect on the documentation?
Hi @Jakob37 and @northwestwitch, will try to work on this later this week!
There is some weird double counting happening in feed_range.
Added a print statement in feed_range to see whats happening:
fn feed_range(&mut self, left: u32, right: u32, value: &mut Once<i32>) -> bool {
let value = value.next().unwrap();
println!(
"bed_region:({}, {}) feed_range:({}, {}), value:{:?}",
self.region_begin, self.region_end, left, right, value
);
for (i, thresh) in self.thresholds.iter().enumerate() {
if value as u32 >= *thresh {
self.counts[i] += std::cmp::min(right, self.region_end) - left as u32
}
}
true
}
Output with ranges.bed unmodified from above:
❯ ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 3), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(4, 6) feed_range:(0, 3), value:0
bed_region:(4, 6) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(4, 5), value:2
bed_region:(3, 5) feed_range:(4, 5), value:2
bed_region:(4, 6) feed_range:(4, 5), value:2
bed_region:(4, 6) feed_range:(4, 6), value:2
#Chr Start End 1x 2x
chr 0 5 0.600 0.200
chr 3 5 1.500 0.500
chr 4 6 2.000 1.500
If we make ranges.bed just:
chr 0 5
chr 3 5
Output looks better:
❯ ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 3), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(4, 5), value:2
bed_region:(3, 5) feed_range:(4, 5), value:2
#Chr Start End 1x 2x
chr 0 5 0.400 0.200
chr 3 5 1.000 0.500
There is some weird double counting happening in feed_range.
Hmmm. Yes, there indeed seems to be something strange going on.
I tried reducing the ranges further, to get something minimal to debug:
chr 3 5
chr 4 5
This gave me:
$ /home/jakob/src/d4-format-cademirch/target/debug/d4tools stat -H -s perc_cov=1,2 -r ranges_minimal_debug.bed out.d4
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 4 5
feed_range:(4, 5), value:2
feed_rows 4 5
feed_range:(4, 5), value:2
#Chr Start End 1x 2x
chr 3 5 1.500 0.500
chr 4 5 2.000 1.000
Indeed, something is strange here. I'll see if I can narrow it down.
Hi! I gave a this branch a try using chanjo2 and as you say, looks like it's double the expected values.
This is for instance the values from chanjo2 main branch (completeness computed with an algorithm in chanjo2):
This is the result using this branch of d4tools:
It's not exactly double values because the stats are calculated in a different way, but kindof 🤔
I suspect something is going wrong in the scan_partition_impl call in the track.rs file.
scan_partition_impl is called multiple times (once per range), and active_handles.iter_mut() is also processing specific ranges multiple times (up to once per range).
https://github.com/38/d4-format/blob/68d49025c8a3e57e90486e6566e8f24a8e125f06/d4/src/d4file/track.rs#L181-L198
The issue seems to be triggered at the bounds between different coverage levels.
This is fine:
#Chr Start End 1x 2x
chr 2 5 0.667 0.333
chr 3 5 1.000 0.500
But if you add 1 to the 3-5 range:
#Chr Start End 1x 2x
chr 2 5 1.000 0.333
chr 4 5 2.000 1.000
If extending it to 2nts on the right side, it is still weird:
#Chr Start End 1x 2x
chr 2 5 1.000 0.333
chr 4 6 2.000 1.500
Edit: Running only the second range looks fine.
#Chr Start End 1x 2x
chr 4 6 1.000 1.000
Can you guys see what might be going wrong here?
I'll circle back to this later this week, to see it again with fresh eyes.
Thanks for taking a look at this @Jakob37 and @northwestwitch. I too ended up at the scan_partition_impl closure in my debugging. I think the issue stems from last_right being set to part_left at the beginning of each partition scan. When this happens with multiple bed ranges that span multiple partitions (I believe partitions are defined where coverage changes), then double feeds occur. My solution is to add a global_last_right outside the closure to keep track the last right position across partitions:
fn scan_partition<DS: DataScanner<Self::RowType>>(&mut self, handles: &mut [DS]) {
let default_primary_value = self.primary.default_value();
let mut decoder = self.primary.make_decoder();
let mut global_last_right = None; // Track the last right across all partitions
let mut part_num = 0;
scan_partition_impl(handles, |part_left, part_right, active_handles| {
if let Some(default_value) = default_primary_value {
part_num += 1;
let iter = self.secondary.seek_iter(part_left);
let mut last_right = global_last_right.unwrap_or(part_left);
// let mut last_right = part_left;
println!("scan_partition {} part left: {} part right: {} last_right: {}, global: {:?}", part_num, part_left, part_right, last_right, global_last_right);
for (mut left, mut right, value) in iter {
left = left.max(part_left);
right = right.min(part_right).max(left);
println!("scan_partition {} left: {} right: {}, last_right: {}", part_num, left, right, last_right);
if last_right >= left {
left = last_right;
}
...
Outputs: (Print debugging saves the day 😄)
❯ RUST_BACKTRACE=1 ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
scan_partition 1 part left: 0 part right: 3 last_right: 0, global: None
scan_partition 1 left: 3 right: 3, last_right: 0
feed_rows 1 last_right:0, left:3
feedrange bed_region:(0, 5) feed_range:(0, 3), value:0
feed_rows 1 left:3, right:3
feedrange bed_region:(0, 5) feed_range:(3, 3), value:1
scan_partition 2 part left: 0 part right: 4 last_right: 3, global: Some(3)
scan_partition 2 left: 3 right: 4, last_right: 3
feed_rows 2 left:3, right:4
feedrange bed_region:(0, 5) feed_range:(3, 4), value:1
feed_rows 2 left:3, right:4
feedrange bed_region:(3, 5) feed_range:(3, 4), value:1
scan_partition 3 part left: 0 part right: 5 last_right: 4, global: Some(4)
scan_partition 3 left: 3 right: 4, last_right: 4
feed_rows 3 left:4, right:4
feedrange bed_region:(0, 5) feed_range:(4, 4), value:1
feed_rows 3 left:4, right:4
feedrange bed_region:(3, 5) feed_range:(4, 4), value:1
feed_rows 3 left:4, right:4
feedrange bed_region:(4, 6) feed_range:(4, 4), value:1
scan_partition 3 left: 4 right: 5, last_right: 4
feed_rows 3 left:4, right:5
feedrange bed_region:(0, 5) feed_range:(4, 5), value:2
feed_rows 3 left:4, right:5
feedrange bed_region:(3, 5) feed_range:(4, 5), value:2
feed_rows 3 left:4, right:5
feedrange bed_region:(4, 6) feed_range:(4, 5), value:2
scan_partition 4 part left: 4 part right: 6 last_right: 5, global: Some(5)
scan_partition 4 left: 4 right: 6, last_right: 5
feed_rows 4 left:5, right:6
feedrange bed_region:(4, 6) feed_range:(5, 6), value:2
#Chr Start End 1x 2x
chr 0 5 0.400 0.200
chr 3 5 1.000 0.500
chr 4 6 1.000 1.000
Note for the bed range chr 3 5 I think this is the correct output since base 3 has a depth of 1 and base 4 has a depth of 2, thus 100% of bases are >= 1x and 50% are >= 2x.
I will setup up some more tests for this fix and try make sure this doesn't break anything else. Should probably make a separate PR for this? Maybe @brentp has a preference?
Great job debugging @cademirch ! Home with sick kid today, but will test and see if I can "break" your new code later this week 😅
My workaround seems to break the test in d4tools/test/test-results/stat/region-stat. Still trying to wrap my head around it. In the test, none of the bed regions are overlapping which seems to be the crux of the issue in this PR. It seems like scan_partition_impl was designed with the expectation that it would be serving only one bed interval some data (this is speculation). It may take some more clever solution to handle both cases.
I have done some more digging around. Takes some time to penetrate this!
I have another solution-candidate.
In this chunk:
https://github.com/38/d4-format/blob/68d49025c8a3e57e90486e6566e8f24a8e125f06/d4/src/d4file/track.rs#L66-L82
The active handles are overlapping ranges.
func will retrieve values from the d4file within the range and perform the supplied stat calculation for each handle (active_heap). This is what we debugged above. I suspect the errors here will lead to follow-up error on those parts. Maybe you can see if your numbers makes sense with this change @cademirch
I used a simple example with intervals 0 - 10 and 4 - 5.
Here I could see that something went wrong. Calculations was performed for ranges:
- For range
0 - 4 - For range
4 - 5 - For range
0 - 10(this is the error - should be5 - 10I think)
Here are my debug prints. The 0 - 10 result is wrong - we expects 6 bases above 1x and 3 above 2x.
func (2) called for range: 0 4
active_heap[0]: 0 10
func (1) called for range: 4 5 (last_end 4)
active_heap[0]: 4 5
active_heap[1]: 0 10
func (1) called for range: 0 10 (last_end 5)
active_heap[0]: 0 10
No more handles left
#Chr Start End 1x 2x
chr 0 10 0.800 0.400
chr 4 5 1.000 1.000
I updated it such that it only performs calculations for intervals once. I needed to update this both in line 73 and 86.
let this_begin;
if top.get_range().0 > last_end {
this_begin = top.get_range().0;
} else {
this_begin = last_end;
}
Then it calculates for the expected ranges, and gives the expected results.
func (2) called for range: 0 4
active_heap[0]: 0 10
func (1) called for range: 4 5 (last_end 4)
active_heap[0]: 4 5
active_heap[1]: 0 10
func (1) called for range: 5 10 (last_end 5)
active_heap[0]: 0 10
No more handles left
#Chr Start End 1x 2x
chr 0 10 0.600 0.300
chr 4 5 1.000 1.000
Trying it out on the ranges you used:
func (2) called for range: 0 3
active_heap[0]: 0 5
func (2) called for range: 3 4
active_heap[0]: 0 5
active_heap[1]: 3 5
func (1) called for range: 4 5 (last_end 4)
active_heap[0]: 0 5
active_heap[1]: 3 5
active_heap[2]: 4 6
func (1) called for range: 5 6 (last_end 5)
active_heap[0]: 4 6
#Chr Start End 1x 2x
chr 0 5 0.400 0.200
chr 3 5 1.000 0.500
chr 4 6 1.000 1.000
I will do some further testing here as well, and see if this works in the full test suite.
Looks like the change above passes the tests (guessing the remove-view errors are unrelated).
Wow, awesome @Jakob37! I was planning on picking up where I left off on this tomorrow morning, but it looks like you've figured it out! And I think this is the best solution since this is fixing the issue in the scan_partition_impl definition, rather than the closure we were looking at before.
And I think this is the best solution since this is fixing the issue in the
scan_partition_impldefinition, rather than the closure we were looking at before.
Yes, I agree!
But then it seems we can start wrapping this up by adding tests and updating the docs.
I plan to test some more edge cases later today. Will let you know how it goes.
If there is something else I can help out with, let me know!
Great. I'll do some testing as well. Will you make a new PR with your fix? Then if all is good @brentp can merge that and then this one.
Great. I'll do some testing as well. Will you make a new PR with your fix? Then if all is good @brentp can merge that and then this one.
OK, I'll put up the PR!
PR up: #84
Merged! Let's wrap this one up now. Seems like the remaining steps are:
- Document the changes in the command line interface
- Document the changes in the documentation
- Add one or several tests based on the 10nt d4-file (now present in the test data folder)
Just let me know if you want help with any of this @cademirch
@Jakob37 great! I'll try to get this done asap.
alright @Jakob37 and @brentp, I think this is ready to merge!
Alright I think this should be good to go. LMK if there is anything else!
Nice! If @brentp doesn't have further comments, then let's get it in 👌
Would you be able to merge and release @brentp ? Many thanks 🙏