seqan icon indicating copy to clipboard operation
seqan copied to clipboard

FormattedFile::readRecord is slow with std::string

Open mr-c opened this issue 10 years ago • 10 comments

stream/tokenization.h:311 Assertion failed : optr < ochunk.end should be true but was 0

Here's a minimal test case:

(env)mcrusoe@athyra:~/khmer/gl-master$ g++ -I/home/mcrusoe/src/seqan/seqan-1.4.2/core/include/ seqan-std-string-test-1.4.cc -o seqan-std-string-test-1.4 -g
(env)mcrusoe@athyra:~/khmer/gl-master$ g++ -I/home/mcrusoe/src/seqan/seqan-seqan-v2.0.0/include/ seqan-std-string-test-2.0.cc -o seqan-std-string-test-2.0 -g
(env)mcrusoe@athyra:~/khmer/gl-master$ ./seqan-std-string-test-1.4 tests/test-data/random-20-a.fa | head
35      CGCAGGCTGGATTCTAGAGGCAGAGGTGAGCTATAAGATATTGCATACGTTGAGCCAGC
16      CGGAAGCCCAATGAGTTGTCAGAGTCACCTCCACCCCGGGCCCTGTTAGCTACGTCCGT
46      GGTCGTGTTGGGTTAACAAAGGATCCCTGACTCGATCCAGCTGGGTAGGGTAACTATGT
40      GGCTGAAGGAGCGGGCGTACGTGTTTACGGCATGATGGCCGGTGATTATGGGGGACGGG
33      GCAGCGGCTTTGAATGCCGAATATATAACAGCGACGGGGTTCAATAAGCTGCACATGCG
98      ACCAGATGCATAGCCCAACAGCTGAGACATTCCCAGCTCGCGAACCAAGACGTGAGAGC
17      CCCTGTTAGCTACGTCCGTCTAAGGATATTAACATAGTTGCGACTGCGTCCTGTGCTCA
89      GCGAGATACTAGCAAAGGTTCATCAACAGCTACACCCGACGAACCCCGAGAAATTGGGA
30      GTTATGGTCCAGGATGAATGCGCGTACCGGGCGCCTATCACTCCTCTTGTCATTCAGAA
82      ATGCACTATATTTAAGAGGTCTAGAGTGTAAAAAGTGTACCCTTCGGGGTGGAGCTGTT
(env)mcrusoe@athyra:~/khmer/gl-master$ ./seqan-std-string-test-2.0 tests/test-data/random-20-a.fa
/home/mcrusoe/src/seqan/seqan-seqan-v2.0.0/include/seqan/stream/tokenization.h:311 Assertion failed : optr < ochunk.end should be true but was 0
Aborted (core dumped)

Seqan 1.4 version:

#include <seqan/seq_io.h>

int main(int argc, char const ** argv) {
        seqan::SequenceStream _stream;
        seqan::open(_stream, argv[1]);
        std::string name;
        std::string sequence;
        std::string quality;
        while (!seqan::atEnd(_stream)) {
                seqan::readRecord(name, sequence, quality, _stream);
                std::cout << name << '\t' << sequence << '\t' << quality << '\n';
        } 
}

Seqan 2.0 version

#include <seqan/seq_io.h>

int main(int argc, char const ** argv) {
        seqan::SeqFileIn _file;
        seqan::open(_file, argv[1]);
        std::string name;
        std::string sequence;
        std::string quality;
        //seqan::readRecord(name, sequence, _file);
        while (!seqan::atEnd(_file)) {
                seqan::readRecord(name, sequence, quality, _file);
                std::cout << name << '\t' << sequence << '\t' << quality << '\n';
        } 
}

(I ran into this by implementing the Seqan 2.0 changes as recommended by @esiragusa in https://github.com/esiragusa/khmer/commit/66f8c688d74acf06362dc91c35953f6c460c5bba )

mr-c avatar Feb 24 '15 20:02 mr-c

We fixed the compiler error by disabling the chunking for std::strings (#870). So using std::strings should now work.

However, the non-chunked version is slower than the chunked version. So we keep this issue open with the following tasks:

  1. Add chunking functionality to std::string
  2. Add chunking functionality to std::vector.

Note, that the overloaded iterator mfn for the std::vector wrapper lead to the us of the default Chunk definition which defines the type Nothing and hence disables chunking for vectors as well. See #870 for further discussion on this topic.

rrahn avatar Feb 26 '15 11:02 rrahn

Thank you @rrahn and company for the quick turnaround!

What sort of workaround can I implement in the mean time to test the speed improvements without having to use Dna5String and CharString in downstream code?

mr-c avatar Feb 26 '15 18:02 mr-c

Any news on this @temehi ? Would be great to have this fixed for 2.2.0 If there are still issues we can discuss this sometime this week.

h-2 avatar Jun 13 '16 10:06 h-2

@h-2

Any news on this @temehi ?

Unfortunately no!

As we discussed before we need to come-up with a new way buffering and chunking ... the current implementation works only for seqan strings optimally. We could discuss this next Monday as I am not working the rest of the week

temehi avatar Jun 14 '16 12:06 temehi

I will modify the milestone to 3.0.0 and close this issue,

temehi avatar Sep 28 '17 13:09 temehi

? Why was this closed?

mr-c avatar Sep 28 '17 13:09 mr-c

@mr-c we were unable to fix this due to a complicated design issue for any of 2.x.x releases. Sorry for the inconvenience.

temehi avatar Sep 28 '17 14:09 temehi

@temehi That's fine, I understand moving to a new milestone, but shouldn't the issue remain open?

mr-c avatar Sep 28 '17 14:09 mr-c

Yes, please keep it open!

h-2 avatar Oct 01 '17 21:10 h-2

will likely be fixed in SeqAn3

rrahn avatar Nov 19 '18 16:11 rrahn