seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Seqkit sana fails on valid FASTQ

Open mw55309 opened this issue 2 years ago • 5 comments

e.g. this is valid FASTQ from the SRA:

@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+SRR2175749.1/1
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+SRR2175749.10/1
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

output:

[INFO] File: test1.fastq        Discarded line: Invalid line states!    1:       @SRR2175749.1/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    2:       GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
[INFO] File: test1.fastq        Discarded line: Invalid line states!    3:       +SRR2175749.1/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    4:       BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
[INFO] File: test1.fastq        Discarded line: Invalid line states!    5:       @SRR2175749.10/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    6:       GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
[INFO] File: test1.fastq        Discarded line: Invalid line states!    7:       +SRR2175749.10/1
[INFO] File: test1.fastq        Discarded line: Invalid line states!    8:       BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB
[INFO] File: test1.fastq        Pass records: 0 Discarded lines: 8

However, remove cany characters after the + sign, and it validates fine:

@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

output:

[INFO] File: test1.fastq        Pass records: 2 Discarded lines: 0
@SRR2175749.1/1
GCCATACGTATCGATGTAAGCTGGATCCTCGGTTGTTGTTCCGCCAAAACCGCTCTTTGTAAACGGACTGTAAACGGTGGCACTTGCCGGCATATAGATAG
+
BBBFFFFFFFFFFFFFIIFFFFFIFFFFFFBFBFFFFIFFFBFFFF7<BFFIIFFFFFFB<BFBBBBBFB<<BBB<BB7<BBFFBF<7<BBBB<BFBBFFB
@SRR2175749.10/1
GGTTTGGATAGACCGCATCAACTTTGACAAGGTTATCTCCAACCTGCTCTCCAATGCCATGAAGTATACCTTTGATGGCGGCGAGATTACTGTTATCATAG
+
BBBFFFFFFFFFFIIIIIIIIIIIIFIIIIFIFFFIIIIIIIIIIIIIIIIIIFFIIIFFIIIIBFFFFFFFFFFFFFFFFFBFBFBBFFF<BBB<BBFBB

If you want the actual files, they are here:

https://downloads.hmpdacc.org/dacc/hhs/genome/microbiome/wgs/analysis/hmwgsqc/v2/SRS1041116.tar.bz2

mw55309 avatar Dec 24 '23 13:12 mw55309

This was also reported here: https://github.com/shenwei356/seqkit/issues/408.

I just checked the code (written by @botond-sipos ) and found the method did not consider cases where the + line would be followed by some contents, which is a valid format. While character + would also appear in the quality line, it's hard to distinguish them just according to a single line. So I think it needs some big changes with the code, however, @botond-sipos is on his Christmas vacation.

I might rewrite it If I have time (not now, sorry).

shenwei356 avatar Dec 24 '23 15:12 shenwei356

Thanks!

If you point me to the relevant code, I can have a quick look to see how it might be fixed. I see where the error is thrown, but where is the matching/checking code?

mw55309 avatar Dec 24 '23 16:12 mw55309

Note also that #golang FASTQ parsers exist e.g. https://pkg.go.dev/github.com/TimothyStiles/[email protected]/io/fastq

mw55309 avatar Dec 24 '23 17:12 mw55309

The code is here: https://github.com/shenwei356/seqkit/blob/v2.6.1/seqkit/cmd/sana.go#L274-L279

Other subcommands in seqkit use my parser here: https://github.com/shenwei356/bio/blob/master/seqio/fastx/reader.go . It looks complicated because I want to seamlessly parse both FASTA and FASTQ with single- or multiple-line sequence/quality.

shenwei356 avatar Dec 24 '23 19:12 shenwei356

@mw55309 My apologies for the delayed response! Unfortunately, sana currently doesn't support FASTQ files with identifiers in the separator lines. I'll update the help message in a PR to clarify this. As @shenwei356 mentioned, it's difficult to reliably distinguish separator lines with identifiers from quality lines within sana's parsing system.

While other Golang FASTQ parsers exist, they typically stop at the first error. Sana's parser is designed to be more forgiving, skipping malformed lines and continuing to process the rest of the file.

I hope this explains sana's current limitations!

botond-sipos avatar Feb 23 '24 08:02 botond-sipos

I think the logic that makes sense to me is:

  1. If the line starts with + and it is 1 character long, it is a plus line (by the way this will fail for sequences of length 1 which have a quality of + but this is an extreme edge case)

  2. If the line starts with + and it is more than 1 character long, if the previous line was a sequence line, then it is the plus line, if the previous line was plus line, it is quality line, if the previous line is quality line, it is quality line

So can fix the problem by keeping a record of what the previous line category is?

mw55309 avatar Mar 04 '24 17:03 mw55309

Yes. We have to think about it. It's feasible.

shenwei356 avatar Mar 04 '24 21:03 shenwei356

@mw55309 I started working on parts of the suggested logic. It is a quick hack, but I implemented the lookback to decide the state of the separator lines. You can find it here, I would appreciate it if you tested it and let me know if it meets the expectations. If all goes well I will submit a PR after implementing some tests.

What we definitely cannot support in the current parsing framework is multiline sequences and qualities. The parser architecture is complicated by the fact that is an "online" parser capable of streaming records from files, which are being written. This feature is used in the scat subcommand. That said, a robust parser could be simpler and we discussed the possibility of a rewrite with @shenwei356 .

botond-sipos avatar Mar 05 '24 08:03 botond-sipos

Fixed in v2.8.1

shenwei356 avatar Apr 07 '24 09:04 shenwei356