d4-format icon indicating copy to clipboard operation
d4-format copied to clipboard

d4tools stat gives occasionally very large (incorrect) numbers

Open Jakob37 opened this issue 7 months ago • 3 comments

We found this when working with long reads data, where coverage bins tend to be longer than for short reads data. We use d4tools to calculate bin-by-bin coverage. Here, frequently, certain mean coverage values became very large.

Example

$ ./d4tools stat --region 100bp_bins_more.bed example.d4 
chr1    54000   54100   33.05
chr1    54100   54200   32.51
chr1    54200   54300   3083160.33
chr1    54300   54400   3083129.18
chr1    54400   54500   30.16
chr1    54500   54600   30
chr1    54600   54700   30
chr1    54700   54800   30
chr1    54800   54900   29.06
chr1    54900   55000   29.54

Underlying data

$ ./d4tools view --region-file 100bp_bins_more.bed example.d4 
chr1    54000   54005   34
chr1    54005   54100   33
chr1    54100   54151   33
chr1    54151   54200   32
chr1    54200   54215   32
chr1    54215   54300   31
chr1    54300   54400   31
chr1    54400   54416   31
chr1    54416   54500   30
chr1    54500   54600   30
chr1    54600   54700   30
chr1    54700   54800   30
chr1    54800   54806   30
chr1    54806   54900   29
chr1    54900   54946   29
chr1    54946   55000   30

It looks like it comes from here:

https://github.com/38/d4-format/blob/6f4031309958a7a1e471c9081369f2012d82705e/d4/src/d4file/track.rs#L181-L207

Under certain conditions it loops indefinitely, exhausting the iterator of ranges rather than stopping when done with the requested range.

I have tried just breaking early when the left part of the coverage region is outside the part of interest, and it seems to work fine, i.e. by adding inside the loop:

if left >= part_right {
  break;
}

Existing tests run fine with this added. I will see if I can trigger this error in a minimal test data case and come back with a PR.

Jakob37 avatar Jul 09 '25 10:07 Jakob37

Trying to set up a testing dataset for this using single chromosomes. The input coverage for these ranges is the same, but the stats now looks sane. This makes it hard for me to reproduce the issue. I am holding off on this for now.

Jakob37 avatar Aug 11 '25 09:08 Jakob37

I tested calculating coverage bins across a full short read GIAB dataset, and I found no extreme (i.e. >10000) bins there. So it seems to somehow be unique for the structure of long reads.

Have you guys tried using d4 files for coverage calculation for long reads @northwestwitch ? Have you seen anything similar to this?

Jakob37 avatar Aug 12 '25 07:08 Jakob37

Have you guys tried using d4 files for coverage calculation for long reads @northwestwitch ? Have you seen anything similar to this?

Hi, we haven't noticed it, but we don't have so many cases analysed with long reads yet, which could explain why

northwestwitch avatar Aug 12 '25 07:08 northwestwitch