sdust
sdust copied to clipboard
Out of range bug (on behalf of Roberto Rossini)
Sdust produces wrong output for the following sequence:
>TEST_SEQ
AAACCCTACGCACACAAACCTCCCTCACACAANCCCTACGCNCACAAACCTNCCTCACACAAACCCTACGCACACAAACCTCCCTCACACAAACCCTACGCACACAAA
The sequence length is 108bp, but the range is 52 116
.
@lh3 I saw this too testing my fork of fastp integrating your code. I'm now just trimming the last masked range in the obvious way if it overhangs the sequence end. That seems to be working fine, unless there's any suspicion of some deeper correctness issue with masking near the end of the given sequence -- I don't yet understand the algorithm well enough to guess.
Here's a more-troubling case, where an entire range falls beyond the sequence length:
$ echo -e '>x\nATGTTTATGCTCATCGCTAAAAAGATTAAAATCGTTTGAGTTAAAGGATTTCACATGCTCGCTTTAGAAATTTATATTGATATTTGTTTGAAAGACGCTTTAATAGATTANTTCGTTTGAAAAAGGCTTTGATGATTTTTTTTATGTGGA' | tee x.fastq | wc -c
154
$ ./sdust x.fastq
x 167 175
Here's a more-troubling case, where an entire range falls beyond the sequence length:
$ echo -e '>x\nATGTTTATGCTCATCGCTAAAAAGATTAAAATCGTTTGAGTTAAAGGATTTCACATGCTCGCTTTAGAAATTTATATTGATATTTGTTTGAAAGACGCTTTAATAGATTANTTCGTTTGAAAAAGGCTTTGATGATTTTTTTTATGTGGA' | tee x.fastq | wc -c 154 $ ./sdust x.fastq x 167 175
The range overflow is caused by the base N
and a temporary work around for this is to treat N
as A
in the seq_nt4_table
:
unsigned char seq_nt4_table[256] = {
0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 0/*N as A*/, 4,
4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 0, 4, 1, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 0/*n as a*/, 4,
4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
};
- Test data
cat a.fa
>1
GCCAGGCTGGCCAAGGAGANTCttttttttttttttttttttttttAAGA
>2
GCCAGGCTGGCCAAGGAGATTCttttttttttttttttttttttttAAGA
./sdust -w50 -t20 a.fa
1 22 46
2 22 46