d4tools stat gives occasionally very large (incorrect) numbers
We found this when working with long reads data, where coverage bins tend to be longer than for short reads data. We use d4tools to calculate bin-by-bin coverage. Here, frequently, certain mean coverage values became very large.
Example
$ ./d4tools stat --region 100bp_bins_more.bed example.d4
chr1 54000 54100 33.05
chr1 54100 54200 32.51
chr1 54200 54300 3083160.33
chr1 54300 54400 3083129.18
chr1 54400 54500 30.16
chr1 54500 54600 30
chr1 54600 54700 30
chr1 54700 54800 30
chr1 54800 54900 29.06
chr1 54900 55000 29.54
Underlying data
$ ./d4tools view --region-file 100bp_bins_more.bed example.d4
chr1 54000 54005 34
chr1 54005 54100 33
chr1 54100 54151 33
chr1 54151 54200 32
chr1 54200 54215 32
chr1 54215 54300 31
chr1 54300 54400 31
chr1 54400 54416 31
chr1 54416 54500 30
chr1 54500 54600 30
chr1 54600 54700 30
chr1 54700 54800 30
chr1 54800 54806 30
chr1 54806 54900 29
chr1 54900 54946 29
chr1 54946 55000 30
It looks like it comes from here:
https://github.com/38/d4-format/blob/6f4031309958a7a1e471c9081369f2012d82705e/d4/src/d4file/track.rs#L181-L207
Under certain conditions it loops indefinitely, exhausting the iterator of ranges rather than stopping when done with the requested range.
I have tried just breaking early when the left part of the coverage region is outside the part of interest, and it seems to work fine, i.e. by adding inside the loop:
if left >= part_right {
break;
}
Existing tests run fine with this added. I will see if I can trigger this error in a minimal test data case and come back with a PR.
Trying to set up a testing dataset for this using single chromosomes. The input coverage for these ranges is the same, but the stats now looks sane. This makes it hard for me to reproduce the issue. I am holding off on this for now.
I tested calculating coverage bins across a full short read GIAB dataset, and I found no extreme (i.e. >10000) bins there. So it seems to somehow be unique for the structure of long reads.
Have you guys tried using d4 files for coverage calculation for long reads @northwestwitch ? Have you seen anything similar to this?
Have you guys tried using d4 files for coverage calculation for long reads @northwestwitch ? Have you seen anything similar to this?
Hi, we haven't noticed it, but we don't have so many cases analysed with long reads yet, which could explain why