Question about SyncmerIterator
Hi,
I'm studying your implementation of SyncmerIterator, and I'm a bit confused. It's possible there's a bug in the code, but I think it's much more likely that I just don't understand the intended semantics correctly.
The code runs accumulators for for the current k-mer and the current s-mer, keeping both the forward and reverse complement *-mers, which makes sense.
It then makes a canonical choice of which s-mer to hash, and adds that to the queue.
What follows is a set of cases that evaluate or re-evaluate the position of the minimising s-mer.
The first question I have is whether it matters that the branch at https://github.com/ksahlin/strobealign/blob/d7e73cc2f4343c8c9b555ca91b8d25b9712eee0b/src/randstrobes.cpp#L92 and the branch at https://github.com/ksahlin/strobealign/blob/d7e73cc2f4343c8c9b555ca91b8d25b9712eee0b/src/randstrobes.cpp#L107 iterate in opposite directions? In the case of equal hashes, they will select a different minimum position, which could result in the selection of different strobemers. It is a very corner case, I concede, since the former branch in the code handles the very first k-mer, but unless I'm wrong, this could result in inconsistent strobmer choice. This is really a very minor detail, but I mention it since I have been examining the area of the code in detail.
The more important question pertains to the canonical choice of k-mers. I note that the code at https://github.com/ksahlin/strobealign/blob/d7e73cc2f4343c8c9b555ca91b8d25b9712eee0b/src/randstrobes.cpp#L119 makes a canonical choice between the forward and reverse complement sequences, and reports that hash as the syncmer.
However, if I present the code with the reverse complement of the original sequence, the same set of s-mers will be derived, but in the opposite order, so that the position of the minimiser will be at the opposite end of the queue, and so the test will not pass.
My understanding of the intended semantics is that the set of syncmers would be the same regardless of which strand the original sequence is on, and certainly by reporting the hash of the min k-mer, the code creates that impression.
I'd love to understand where my reasoning has gone off the rails!
Thanks,
Tom.
Hi and thanks for doing this analysis. I’ve so far only had time to look into the first issue, and I think it’s a bug that the very first k-mer is treated differently. What I’m wondering about is whether the same applies here: https://github.com/ksahlin/strobealign/blob/d7e73cc2f4343c8c9b555ca91b8d25b9712eee0b/src/randstrobes.cpp#L113 This should also only find the first minimum.
Please give me some time for looking into your other question/observation.
I think you're right, that perhaps the test on the line you cite should be <=.
To a certain extent, it doesn't necessarily matter exactly which precise definition you use for syncmers, so long as the implementation expresses that definition on all code paths.
The other issue is probably more important - if I give the sequence to be indexed on the reverse complement strand, will I get the same sequence of syncmers (in reverse order)? Should that be the case? (I think it should!)
What I haven't thought through carefully is how the strandedness issue intersects with the formation of strobemers.
I note that neither the Syncmer nor Strobemer papers mention strandedness.
To a certain extent, it doesn't necessarily matter exactly which precise definition you use for syncmers, so long as the implementation expresses that definition on all code paths.
Yes, and syncmers just need to be generated in the same way for the reference and the query. Even if the comparisons are a bit inconsistent, it doesn’t matter so much as long as it’s inconsistent in the same way.
The other issue is probably more important - if I give the sequence to be indexed on the reverse complement strand, will I get the same sequence of syncmers (in reverse order)? Should that be the case? (I think it should!)
I looked into this now as well and as a first exercise, wrote a test to ensure that syncmers are canonical (see #465/#464). You’re correct that the order of s-mers is reversed in the vector, but this does not matter because the s-mer that determines whether the syncmer is chosen or not is the one in the middle. The t parameter is chosen in that way. See also
https://github.com/ksahlin/strobealign/blob/d7e73cc2f4343c8c9b555ca91b8d25b9712eee0b/src/indexparameters.hpp#L32
So the relevant s-mer is in the same position both in forward and reverse direction.
However, there’s a slight inconsistency once again due to the way how ties are handled: Because the syncmer was chosen if the leftmost minimum was at that middle position, this is different in the reverse complemented version of the k-mer because the leftmost minimum would then be the rightmost one. The changes I made in #464 happen to fix that issue as well.
I note that neither the Syncmer nor Strobemer papers mention strandedness.
Yes, I think this also needs a bit more research. I’d like to switch to non-canonical syncmers and randstrobes because my impression is that it would both simplify the code in many places, speed up mapping, and make it easier to think about what happens, but this loses some accuracy, see the discussion in #405.
Hi,
As mentioned elsewhere I have been 'off' github because an old university email address was discontinued so I haven’t gotten any notifications from GitHub for about three weeks.
I now get the context behind the commit to sample all syncmers in homopolymer and other STR regions: https://github.com/ksahlin/strobealign/commit/d7c5cbe2ab2d44406552b40d1afa79b58d1a02e8#comments
Given this discussion I feel more inclined to accept sampling such regions more frequently if it means that syncmers are identical in forward and reverse directions. Initially I was sceptical because some seeds may become very repetitive, causing lower accuracy and more runtime. But I am inclined to see a working implementation of canonical strobemers, which makes it justified.