FormattedFile::readRecord is slow with std::string
stream/tokenization.h:311 Assertion failed : optr < ochunk.end should be true but was 0
Here's a minimal test case:
(env)mcrusoe@athyra:~/khmer/gl-master$ g++ -I/home/mcrusoe/src/seqan/seqan-1.4.2/core/include/ seqan-std-string-test-1.4.cc -o seqan-std-string-test-1.4 -g
(env)mcrusoe@athyra:~/khmer/gl-master$ g++ -I/home/mcrusoe/src/seqan/seqan-seqan-v2.0.0/include/ seqan-std-string-test-2.0.cc -o seqan-std-string-test-2.0 -g
(env)mcrusoe@athyra:~/khmer/gl-master$ ./seqan-std-string-test-1.4 tests/test-data/random-20-a.fa | head
35 CGCAGGCTGGATTCTAGAGGCAGAGGTGAGCTATAAGATATTGCATACGTTGAGCCAGC
16 CGGAAGCCCAATGAGTTGTCAGAGTCACCTCCACCCCGGGCCCTGTTAGCTACGTCCGT
46 GGTCGTGTTGGGTTAACAAAGGATCCCTGACTCGATCCAGCTGGGTAGGGTAACTATGT
40 GGCTGAAGGAGCGGGCGTACGTGTTTACGGCATGATGGCCGGTGATTATGGGGGACGGG
33 GCAGCGGCTTTGAATGCCGAATATATAACAGCGACGGGGTTCAATAAGCTGCACATGCG
98 ACCAGATGCATAGCCCAACAGCTGAGACATTCCCAGCTCGCGAACCAAGACGTGAGAGC
17 CCCTGTTAGCTACGTCCGTCTAAGGATATTAACATAGTTGCGACTGCGTCCTGTGCTCA
89 GCGAGATACTAGCAAAGGTTCATCAACAGCTACACCCGACGAACCCCGAGAAATTGGGA
30 GTTATGGTCCAGGATGAATGCGCGTACCGGGCGCCTATCACTCCTCTTGTCATTCAGAA
82 ATGCACTATATTTAAGAGGTCTAGAGTGTAAAAAGTGTACCCTTCGGGGTGGAGCTGTT
(env)mcrusoe@athyra:~/khmer/gl-master$ ./seqan-std-string-test-2.0 tests/test-data/random-20-a.fa
/home/mcrusoe/src/seqan/seqan-seqan-v2.0.0/include/seqan/stream/tokenization.h:311 Assertion failed : optr < ochunk.end should be true but was 0
Aborted (core dumped)
Seqan 1.4 version:
#include <seqan/seq_io.h>
int main(int argc, char const ** argv) {
seqan::SequenceStream _stream;
seqan::open(_stream, argv[1]);
std::string name;
std::string sequence;
std::string quality;
while (!seqan::atEnd(_stream)) {
seqan::readRecord(name, sequence, quality, _stream);
std::cout << name << '\t' << sequence << '\t' << quality << '\n';
}
}
Seqan 2.0 version
#include <seqan/seq_io.h>
int main(int argc, char const ** argv) {
seqan::SeqFileIn _file;
seqan::open(_file, argv[1]);
std::string name;
std::string sequence;
std::string quality;
//seqan::readRecord(name, sequence, _file);
while (!seqan::atEnd(_file)) {
seqan::readRecord(name, sequence, quality, _file);
std::cout << name << '\t' << sequence << '\t' << quality << '\n';
}
}
(I ran into this by implementing the Seqan 2.0 changes as recommended by @esiragusa in https://github.com/esiragusa/khmer/commit/66f8c688d74acf06362dc91c35953f6c460c5bba )
We fixed the compiler error by disabling the chunking for std::strings (#870). So using std::strings should now work.
However, the non-chunked version is slower than the chunked version. So we keep this issue open with the following tasks:
- Add chunking functionality to std::string
- Add chunking functionality to std::vector.
Note, that the overloaded iterator mfn for the std::vector wrapper lead to the us of the default Chunk definition which defines the type Nothing and hence disables chunking for vectors as well. See #870 for further discussion on this topic.
Thank you @rrahn and company for the quick turnaround!
What sort of workaround can I implement in the mean time to test the speed improvements without having to use Dna5String and CharString in downstream code?
Any news on this @temehi ? Would be great to have this fixed for 2.2.0 If there are still issues we can discuss this sometime this week.
@h-2
Any news on this @temehi ?
Unfortunately no!
As we discussed before we need to come-up with a new way buffering and chunking ... the current implementation works only for seqan strings optimally. We could discuss this next Monday as I am not working the rest of the week
I will modify the milestone to 3.0.0 and close this issue,
? Why was this closed?
@mr-c we were unable to fix this due to a complicated design issue for any of 2.x.x releases. Sorry for the inconvenience.
@temehi That's fine, I understand moving to a new milestone, but shouldn't the issue remain open?
Yes, please keep it open!
will likely be fixed in SeqAn3