d4-format feat: Add percent coverage option to stat

This adds the feature requested in #81 by @Jakob37. I tried my best to emulate the existing code in stat, but would of course appreciate any feedback.

Not sure if you're accepting new features @brentp, happy to discuss more.

Jul 23 '24 17:07 cademirch

I agree tests are needed. @Jakob37 mentioned they can help with tests for this, and we can setup some actions to run them.

@Jakob37 could you possibly provide a small d4 file and regions.bed along with expected output?

Jul 23 '24 21:07 cademirch

Great, thank you for putting together this PR.

@Jakob37 could you possibly provide a small d4 file and regions.bed along with expected output?

I'll look into it!

Jul 24 '24 05:07 Jakob37

Doing some initial testing. I have prepared a small d4 file from the hg002 GIAB dataset where I have cut out one gene.

I run it as such:

$ <home>/d4-format-cademirch/target/debug/d4tools stat -H -s perc_cov=10,20,30,40,50 -r hg002_testdata/test_ranges.bed hg002_testdata/hg002_19_range.d4
#Chr    Start   End     10x     20x     30x     40x     50x
19      45769552        45769606        0.574   0.056   0.000   0.000   0.000
19      45769552        45770149        1.372   1.191   0.936   0.541   0.295
19      45769767        45769811        1.000   1.000   1.000   1.000   0.977

The content of test_ranges.bed is following. First one is a region with low coverage, second is the full range and third is from a high coverage region.

19      45769552        45770149
19      45769552        45769606
19      45769767        45769811

Looks like some strange values here. Coverage should never be above 1 isn't it?

19      45769552        45770149        1.372   1.191   0.936   0.541   0.295

I attach the d4 file, the bed file and a file with the coverage as seen by d4tools show.

hg002_testdata.zip

Will continue digging around a bit, and calculate the real expected values.

Jul 25 '24 09:07 Jakob37

I am a bit confused on how ranges are handled. When I run d4tools show it looks zero indexed, and it looks like the end of each position is the same as the start of the next. I.e. as how ranges are handled in Python.

In the bed file on the other hand, I would expect things to be one-indexed and having "inclusive" ends.

I.e. if having the ranges in the bed file 1 - 3 this would mean the first three positions. And if I want to show the same in Python, it would be 0 - 3.

Is this correct?

Based on this, running just the two very narrow ranges (one indexed bed file):

19      45769557        45769559

And based on the coverage data (first lines in the provided d4 file, zero indexed with exclusive ends)

19	45769552	45769557	1
19	45769557	45769563	2

I would expect the coverage 0.667, but I get 1.

Also, on the second range (one indexed bed file):

19      45769557        45769562

I would expect 0.8 in my testing (i.e. 4 of them falling within the second range), but I get 1.4 when running the PR.

Have to run, I can see if I can clarify this tomorrow.

Jul 25 '24 14:07 Jakob37

Thank you for putting those datasets together and effort testing @Jakob37! I've always understood that bed coordinates should be 0 based and half open.

I'm also going on vacation, but will be back on Tuesday and will take a look at this more then!

Jul 26 '24 13:07 cademirch

Thank you for putting those datasets together and effort testing @Jakob37! I've always understood that bed coordinates should be 0 based and half open.

Aha, then I learned something. It looks like you are right. I'll update the range coordinates above then.

I'm also going on vacation, but will be back on Tuesday and will take a look at this more then!

Sounds great. Enjoy your vacation!

Jul 26 '24 13:07 Jakob37

Aha, then I learned something. It looks like you are right. I'll update the range coordinates above then.

0 vs 1 based genomic coordinates is the most annoying problem 🙄

Sounds great. Enjoy your vacation!

Thank you, you too!

Jul 26 '24 13:07 cademirch

I am back in office now. I am happy to help out further with testing / providing testing data @cademirch . Just let me know what you need. Excited to get this in, as it will mean that we can start using Chanjo2 in production 😄

Aug 12 '24 06:08 Jakob37

Hi @cademirch, I really appreciate this new feature! Good stuff!🥇 Perhaps in this PR you could update the README file to include the new command before merging, so the changes reflect on the documentation?

Aug 12 '24 08:08 northwestwitch

Hi @Jakob37 and @northwestwitch, will try to work on this later this week!

Aug 13 '24 04:08 cademirch

There is some weird double counting happening in feed_range.

Added a print statement in feed_range to see whats happening:

fn feed_range(&mut self, left: u32, right: u32, value: &mut Once<i32>) -> bool {
        let value = value.next().unwrap();
        println!(
            "bed_region:({}, {}) feed_range:({}, {}), value:{:?}",
            self.region_begin, self.region_end, left, right, value
        );
        for (i, thresh) in self.thresholds.iter().enumerate() {
            if value as u32 >= *thresh {
                self.counts[i] += std::cmp::min(right, self.region_end) - left as u32
            }
        }
        true
    }

Output with ranges.bed unmodified from above:

❯ ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 3), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(4, 6) feed_range:(0, 3), value:0
bed_region:(4, 6) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(4, 5), value:2
bed_region:(3, 5) feed_range:(4, 5), value:2
bed_region:(4, 6) feed_range:(4, 5), value:2
bed_region:(4, 6) feed_range:(4, 6), value:2
#Chr    Start   End     1x      2x
chr     0       5       0.600   0.200
chr     3       5       1.500   0.500
chr     4       6       2.000   1.500

If we make ranges.bed just:

chr	0	5
chr	3	5

Output looks better:

❯ ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 3), value:1
bed_region:(0, 5) feed_range:(0, 3), value:0
bed_region:(0, 5) feed_range:(3, 4), value:1
bed_region:(3, 5) feed_range:(0, 3), value:0
bed_region:(3, 5) feed_range:(3, 4), value:1
bed_region:(0, 5) feed_range:(4, 5), value:2
bed_region:(3, 5) feed_range:(4, 5), value:2
#Chr    Start   End     1x      2x
chr     0       5       0.400   0.200
chr     3       5       1.000   0.500

Aug 16 '24 23:08 cademirch

There is some weird double counting happening in feed_range.

Hmmm. Yes, there indeed seems to be something strange going on.

I tried reducing the ranges further, to get something minimal to debug:

chr 3   5
chr 4   5

This gave me:

$ /home/jakob/src/d4-format-cademirch/target/debug/d4tools stat -H -s perc_cov=1,2 -r ranges_minimal_debug.bed out.d4 
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 3 4
feed_range:(3, 4), value:1
feed_rows 4 5
feed_range:(4, 5), value:2
feed_rows 4 5
feed_range:(4, 5), value:2
#Chr    Start   End     1x      2x
chr     3       5       1.500   0.500
chr     4       5       2.000   1.000

Indeed, something is strange here. I'll see if I can narrow it down.

Aug 19 '24 07:08 Jakob37

Hi! I gave a this branch a try using chanjo2 and as you say, looks like it's double the expected values.

This is for instance the values from chanjo2 main branch (completeness computed with an algorithm in chanjo2):

This is the result using this branch of d4tools:

It's not exactly double values because the stats are calculated in a different way, but kindof 🤔

Aug 19 '24 07:08 northwestwitch

I suspect something is going wrong in the scan_partition_impl call in the track.rs file.

scan_partition_impl is called multiple times (once per range), and active_handles.iter_mut() is also processing specific ranges multiple times (up to once per range).

https://github.com/38/d4-format/blob/68d49025c8a3e57e90486e6566e8f24a8e125f06/d4/src/d4file/track.rs#L181-L198

The issue seems to be triggered at the bounds between different coverage levels.

This is fine:

#Chr    Start   End     1x      2x
chr     2       5       0.667   0.333
chr     3       5       1.000   0.500

But if you add 1 to the 3-5 range:

#Chr    Start   End     1x      2x
chr     2       5       1.000   0.333
chr     4       5       2.000   1.000

If extending it to 2nts on the right side, it is still weird:

#Chr    Start   End     1x      2x
chr     2       5       1.000   0.333
chr     4       6       2.000   1.500

Edit: Running only the second range looks fine.

#Chr    Start   End     1x      2x
chr     4       6       1.000   1.000

Can you guys see what might be going wrong here?

I'll circle back to this later this week, to see it again with fresh eyes.

Aug 19 '24 08:08 Jakob37

Thanks for taking a look at this @Jakob37 and @northwestwitch. I too ended up at the scan_partition_impl closure in my debugging. I think the issue stems from last_right being set to part_left at the beginning of each partition scan. When this happens with multiple bed ranges that span multiple partitions (I believe partitions are defined where coverage changes), then double feeds occur. My solution is to add a global_last_right outside the closure to keep track the last right position across partitions:

fn scan_partition<DS: DataScanner<Self::RowType>>(&mut self, handles: &mut [DS]) {
        let default_primary_value = self.primary.default_value();
        let mut decoder = self.primary.make_decoder();
        let mut global_last_right = None; // Track the last right across all partitions
        let mut part_num = 0;
        scan_partition_impl(handles, |part_left, part_right, active_handles| {
            if let Some(default_value) = default_primary_value {
                part_num += 1;
                let iter = self.secondary.seek_iter(part_left);
                let mut last_right = global_last_right.unwrap_or(part_left);
                // let mut last_right = part_left; 
                println!("scan_partition {} part left: {} part right: {} last_right: {}, global: {:?}", part_num, part_left, part_right, last_right, global_last_right);
                for (mut left, mut right, value) in iter {
                    left = left.max(part_left);
                    right = right.min(part_right).max(left);
                    println!("scan_partition {} left: {} right: {}, last_right: {}", part_num, left, right, last_right);
    
                    if last_right >= left {
                        left = last_right;
                    }
...

Outputs: (Print debugging saves the day 😄)

❯ RUST_BACKTRACE=1 ./target/release/d4tools stat -H -s perc_cov=1,2 -r ../perc-cov-testing/ranges.bed ../perc-cov-testing/out.d4
scan_partition 1 part left: 0 part right: 3 last_right: 0, global: None
scan_partition 1 left: 3 right: 3, last_right: 0
feed_rows 1 last_right:0, left:3
feedrange bed_region:(0, 5) feed_range:(0, 3), value:0
feed_rows 1 left:3, right:3
feedrange bed_region:(0, 5) feed_range:(3, 3), value:1
scan_partition 2 part left: 0 part right: 4 last_right: 3, global: Some(3)
scan_partition 2 left: 3 right: 4, last_right: 3
feed_rows 2 left:3, right:4
feedrange bed_region:(0, 5) feed_range:(3, 4), value:1
feed_rows 2 left:3, right:4
feedrange bed_region:(3, 5) feed_range:(3, 4), value:1
scan_partition 3 part left: 0 part right: 5 last_right: 4, global: Some(4)
scan_partition 3 left: 3 right: 4, last_right: 4
feed_rows 3 left:4, right:4
feedrange bed_region:(0, 5) feed_range:(4, 4), value:1
feed_rows 3 left:4, right:4
feedrange bed_region:(3, 5) feed_range:(4, 4), value:1
feed_rows 3 left:4, right:4
feedrange bed_region:(4, 6) feed_range:(4, 4), value:1
scan_partition 3 left: 4 right: 5, last_right: 4
feed_rows 3 left:4, right:5
feedrange bed_region:(0, 5) feed_range:(4, 5), value:2
feed_rows 3 left:4, right:5
feedrange bed_region:(3, 5) feed_range:(4, 5), value:2
feed_rows 3 left:4, right:5
feedrange bed_region:(4, 6) feed_range:(4, 5), value:2
scan_partition 4 part left: 4 part right: 6 last_right: 5, global: Some(5)
scan_partition 4 left: 4 right: 6, last_right: 5
feed_rows 4 left:5, right:6
feedrange bed_region:(4, 6) feed_range:(5, 6), value:2
#Chr    Start   End     1x      2x
chr     0       5       0.400   0.200
chr     3       5       1.000   0.500
chr     4       6       1.000   1.000

Note for the bed range chr 3 5 I think this is the correct output since base 3 has a depth of 1 and base 4 has a depth of 2, thus 100% of bases are >= 1x and 50% are >= 2x.

I will setup up some more tests for this fix and try make sure this doesn't break anything else. Should probably make a separate PR for this? Maybe @brentp has a preference?

Aug 19 '24 17:08 cademirch

Great job debugging @cademirch ! Home with sick kid today, but will test and see if I can "break" your new code later this week 😅

Aug 20 '24 06:08 Jakob37

My workaround seems to break the test in d4tools/test/test-results/stat/region-stat. Still trying to wrap my head around it. In the test, none of the bed regions are overlapping which seems to be the crux of the issue in this PR. It seems like scan_partition_impl was designed with the expectation that it would be serving only one bed interval some data (this is speculation). It may take some more clever solution to handle both cases.

Aug 20 '24 17:08 cademirch

I have done some more digging around. Takes some time to penetrate this!

I have another solution-candidate.

In this chunk:

https://github.com/38/d4-format/blob/68d49025c8a3e57e90486e6566e8f24a8e125f06/d4/src/d4file/track.rs#L66-L82

The active handles are overlapping ranges.

func will retrieve values from the d4file within the range and perform the supplied stat calculation for each handle (active_heap). This is what we debugged above. I suspect the errors here will lead to follow-up error on those parts. Maybe you can see if your numbers makes sense with this change @cademirch

I used a simple example with intervals 0 - 10 and 4 - 5.

Here I could see that something went wrong. Calculations was performed for ranges:

For range 0 - 4
For range 4 - 5
For range 0 - 10 (this is the error - should be 5 - 10 I think)

Here are my debug prints. The 0 - 10 result is wrong - we expects 6 bases above 1x and 3 above 2x.

func (2) called for range: 0 4
  active_heap[0]: 0 10
func (1) called for range: 4 5 (last_end 4)
  active_heap[0]: 4 5
  active_heap[1]: 0 10
func (1) called for range: 0 10 (last_end 5)
  active_heap[0]: 0 10
     No more handles left
#Chr    Start   End     1x      2x
chr     0       10      0.800   0.400
chr     4       5       1.000   1.000

I updated it such that it only performs calculations for intervals once. I needed to update this both in line 73 and 86.

let this_begin;
if top.get_range().0 > last_end {
    this_begin = top.get_range().0;
} else {
    this_begin = last_end;
}

Then it calculates for the expected ranges, and gives the expected results.

func (2) called for range: 0 4
  active_heap[0]: 0 10
func (1) called for range: 4 5 (last_end 4)
  active_heap[0]: 4 5
  active_heap[1]: 0 10
func (1) called for range: 5 10 (last_end 5)
  active_heap[0]: 0 10
     No more handles left
#Chr    Start   End     1x      2x
chr     0       10      0.600   0.300
chr     4       5       1.000   1.000

Trying it out on the ranges you used:

func (2) called for range: 0 3
  active_heap[0]: 0 5
func (2) called for range: 3 4
  active_heap[0]: 0 5
  active_heap[1]: 3 5
func (1) called for range: 4 5 (last_end 4)
  active_heap[0]: 0 5
  active_heap[1]: 3 5
  active_heap[2]: 4 6
func (1) called for range: 5 6 (last_end 5)
  active_heap[0]: 4 6
#Chr    Start   End     1x      2x
chr     0       5       0.400   0.200
chr     3       5       1.000   0.500
chr     4       6       1.000   1.000

I will do some further testing here as well, and see if this works in the full test suite.

Aug 23 '24 06:08 Jakob37

Looks like the change above passes the tests (guessing the remove-view errors are unrelated).

Aug 23 '24 06:08 Jakob37

Wow, awesome @Jakob37! I was planning on picking up where I left off on this tomorrow morning, but it looks like you've figured it out! And I think this is the best solution since this is fixing the issue in the scan_partition_impl definition, rather than the closure we were looking at before.

Aug 23 '24 07:08 cademirch

And I think this is the best solution since this is fixing the issue in the scan_partition_impl definition, rather than the closure we were looking at before.

Yes, I agree!

But then it seems we can start wrapping this up by adding tests and updating the docs.

I plan to test some more edge cases later today. Will let you know how it goes.

If there is something else I can help out with, let me know!

Aug 23 '24 07:08 Jakob37

Great. I'll do some testing as well. Will you make a new PR with your fix? Then if all is good @brentp can merge that and then this one.

Aug 23 '24 07:08 cademirch

Great. I'll do some testing as well. Will you make a new PR with your fix? Then if all is good @brentp can merge that and then this one.

OK, I'll put up the PR!

Aug 23 '24 08:08 Jakob37

PR up: #84

Aug 23 '24 11:08 Jakob37

Merged! Let's wrap this one up now. Seems like the remaining steps are:

Document the changes in the command line interface
Document the changes in the documentation
Add one or several tests based on the 10nt d4-file (now present in the test data folder)

Just let me know if you want help with any of this @cademirch

Aug 27 '24 05:08 Jakob37

@Jakob37 great! I'll try to get this done asap.

Aug 31 '24 07:08 cademirch

alright @Jakob37 and @brentp, I think this is ready to merge!

Sep 09 '24 19:09 cademirch

Alright I think this should be good to go. LMK if there is anything else!

Sep 20 '24 03:09 cademirch

Nice! If @brentp doesn't have further comments, then let's get it in 👌

Sep 20 '24 05:09 Jakob37

Would you be able to merge and release @brentp ? Many thanks 🙏

Sep 24 '24 06:09 Jakob37