Inconsistent canonical hash values between forward and revcomp sequences of the same k-mer when k % 4 == 0 (not in v2.1.0 but) in v2.2.0
First of all, thank you for developing the great software!
I think I found a bug (and also its fix) in the latest version (v2.2.0) of ntHash. Specifically, the canonical hash value computed by the function NTC64(const char * kmerSeq, const unsigned k) can be different between the forward sequence and the reverse complement sequence of the same k-mer when k % 4 == 0.
The minimal code for reproducing the bug, which prints the forward/revcomp/canonical hash values for (A)_k and (T)_k, is as follows:
#include <string>
#include "nthash.hpp"
using namespace std;
void nthash(string kmer, int k) {
uint64_t hVal, fhVal=0, rhVal=0;
hVal = NTC64(kmer.c_str(), k, fhVal, rhVal);
printf("%s: fhVal=%llu, rhVal=%llu, hval=%llu\n", kmer.c_str(), fhVal, rhVal, hVal);
}
int main(int argc, char *argv[]) {
int k = atoi(argv[1]);
nthash(string(k, 'A'), k);
nthash(string(k, 'T'), k);
}
After compiling the code above, you can see the problem in a shell by:
$ ./a.out 1
A: fhVal=4362857412768957556, rhVal=2978368046464386134, hval=2978368046464386134
T: fhVal=2978368046464386134, rhVal=4362857412768957556, hval=2978368046464386134
$ ./a.out 2
AA: fhVal=5015898201438948509, rhVal=8935100007508790523, hval=5015898201438948509
TT: fhVal=8935100007508790523, rhVal=5015898201438948509, hval=5015898201438948509
$ ./a.out 3
AAA: fhVal=13237172352163388750, rhVal=16044915679164358049, hval=13237172352163388750
TTT: fhVal=16044915679164358049, rhVal=13237172352163388750, hval=13237172352163388750
$ ./a.out 4
AAAA: fhVal=10088133878355468230, rhVal=10664720106027613972, hval=10088133878355468230
TTTT: fhVal=5294909351029397815, rhVal=6047278271377325800, hval=5294909351029397815
where hval is inconsistent between (A)_k and (T)_k only when k=4 (which is because fhVal of (A)_k and rhVal of (T)_k are supposed to be the same but actually not, and vice versa).
This happens for large k values as well:
$ ./a.out 30
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=7041943336510964649, rhVal=16957560051676611659, hval=7041943336510964649
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=16957560051676611659, rhVal=7041943336510964649, hval=7041943336510964649
$ ./a.out 31
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=18446744068065231655, rhVal=18446744065269976257, hval=18446744065269976257
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=18446744065269976257, rhVal=18446744068065231655, hval=18446744065269976257
$ ./a.out 32
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=1477743835528954406, rhVal=15468376030029209044, hval=1477743835528954406
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=2311764568324791045, rhVal=14083886662562284090, hval=2311764568324791045
$ ./a.out 33
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=13430845866873192448, rhVal=9511644074189258751, hval=9511644074189258751
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=9511644074189258751, rhVal=13430845866873192448, hval=9511644074189258751
where the case of k=32 is problematic.
I believe the behavior above is due to the conditional processing based on remainder in NTF64:
inline uint64_t NTF64(const char * kmerSeq, const unsigned k) {
...
unsigned remainder = k % 4;
hVal = rolx(hVal, remainder);
hVal = swapxbits033(hVal, remainder);
if (remainder == 3) {
uint8_t trimerLoc = 16 * convertTab[(unsigned char)kmerSeq[k - 3]] + 4 * convertTab[(unsigned char)kmerSeq[k - 2]] + convertTab[(unsigned char)kmerSeq[k - 1]];
hVal ^= trimerTab[trimerLoc];
} else if (remainder == 2) {
uint8_t dimerLoc = 4 * convertTab[(unsigned char)kmerSeq[k - 2]] + convertTab[(unsigned char)kmerSeq[k - 1]];
hVal ^= dimerTab[dimerLoc];
} else if (remainder == 1) {
hVal ^= seedTab[(unsigned char)kmerSeq[k - 1]];
}
return hVal;
}
and I don't think we should update hVal when k % 4 == 0. After that fix, the values of hVal of (A)_k and (T)_k become identical for every case above. This part of the code was added in v2.2.0, and this bug does not appear in v2.1.0.
I will soon make a pull request (which is a very tiny change, though). I'm happy if you can check if the fix is correct.