sdust icon indicating copy to clipboard operation
sdust copied to clipboard

Out of range bug (on behalf of Roberto Rossini)

Open lh3 opened this issue 5 years ago • 3 comments

Sdust produces wrong output for the following sequence:

>TEST_SEQ
AAACCCTACGCACACAAACCTCCCTCACACAANCCCTACGCNCACAAACCTNCCTCACACAAACCCTACGCACACAAACCTCCCTCACACAAACCCTACGCACACAAA

The sequence length is 108bp, but the range is 52 116.

lh3 avatar Feb 22 '20 23:02 lh3

@lh3 I saw this too testing my fork of fastp integrating your code. I'm now just trimming the last masked range in the obvious way if it overhangs the sequence end. That seems to be working fine, unless there's any suspicion of some deeper correctness issue with masking near the end of the given sequence -- I don't yet understand the algorithm well enough to guess.

mlin avatar Aug 01 '22 20:08 mlin

Here's a more-troubling case, where an entire range falls beyond the sequence length:

$ echo -e '>x\nATGTTTATGCTCATCGCTAAAAAGATTAAAATCGTTTGAGTTAAAGGATTTCACATGCTCGCTTTAGAAATTTATATTGATATTTGTTTGAAAGACGCTTTAATAGATTANTTCGTTTGAAAAAGGCTTTGATGATTTTTTTTATGTGGA' | tee x.fastq | wc -c
154
$ ./sdust x.fastq 
x       167     175

mlin avatar Aug 02 '22 08:08 mlin

Here's a more-troubling case, where an entire range falls beyond the sequence length:

$ echo -e '>x\nATGTTTATGCTCATCGCTAAAAAGATTAAAATCGTTTGAGTTAAAGGATTTCACATGCTCGCTTTAGAAATTTATATTGATATTTGTTTGAAAGACGCTTTAATAGATTANTTCGTTTGAAAAAGGCTTTGATGATTTTTTTTATGTGGA' | tee x.fastq | wc -c
154
$ ./sdust x.fastq 
x       167     175

The range overflow is caused by the base N and a temporary work around for this is to treat N as A in the seq_nt4_table:

unsigned char seq_nt4_table[256] = {
        0, 1, 2, 3,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 0, 4, 1,  4, 4, 4, 2,  4, 4, 4, 4,  4, 4, 0/*N as A*/, 4,
        4, 4, 4, 4,  3, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 0, 4, 1,  4, 4, 4, 2,  4, 4, 4, 4,  4, 4, 0/*n as a*/, 4,
        4, 4, 4, 4,  3, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,
        4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4,  4, 4, 4, 4
};
  • Test data
cat a.fa
>1
GCCAGGCTGGCCAAGGAGANTCttttttttttttttttttttttttAAGA
>2
GCCAGGCTGGCCAAGGAGATTCttttttttttttttttttttttttAAGA

./sdust -w50 -t20 a.fa
1	22	46
2	22	46

slw287r avatar Aug 20 '22 12:08 slw287r