ntHash icon indicating copy to clipboard operation
ntHash copied to clipboard

Inconsistent canonical hash values between forward and revcomp sequences of the same k-mer when k % 4 == 0 (not in v2.1.0 but) in v2.2.0

Open yoshihikosuzuki opened this issue 3 years ago • 0 comments

First of all, thank you for developing the great software!

I think I found a bug (and also its fix) in the latest version (v2.2.0) of ntHash. Specifically, the canonical hash value computed by the function NTC64(const char * kmerSeq, const unsigned k) can be different between the forward sequence and the reverse complement sequence of the same k-mer when k % 4 == 0.

The minimal code for reproducing the bug, which prints the forward/revcomp/canonical hash values for (A)_k and (T)_k, is as follows:

#include <string>
#include "nthash.hpp"
using namespace std;

void nthash(string kmer, int k) {
    uint64_t hVal, fhVal=0, rhVal=0;
    hVal = NTC64(kmer.c_str(), k, fhVal, rhVal);
    printf("%s: fhVal=%llu, rhVal=%llu, hval=%llu\n", kmer.c_str(), fhVal, rhVal, hVal);
}

int main(int argc, char *argv[]) {
    int k = atoi(argv[1]);
    nthash(string(k, 'A'), k);
    nthash(string(k, 'T'), k);
}

After compiling the code above, you can see the problem in a shell by:

$ ./a.out 1
A: fhVal=4362857412768957556, rhVal=2978368046464386134, hval=2978368046464386134
T: fhVal=2978368046464386134, rhVal=4362857412768957556, hval=2978368046464386134

$ ./a.out 2
AA: fhVal=5015898201438948509, rhVal=8935100007508790523, hval=5015898201438948509
TT: fhVal=8935100007508790523, rhVal=5015898201438948509, hval=5015898201438948509

$ ./a.out 3
AAA: fhVal=13237172352163388750, rhVal=16044915679164358049, hval=13237172352163388750
TTT: fhVal=16044915679164358049, rhVal=13237172352163388750, hval=13237172352163388750

$ ./a.out 4
AAAA: fhVal=10088133878355468230, rhVal=10664720106027613972, hval=10088133878355468230
TTTT: fhVal=5294909351029397815, rhVal=6047278271377325800, hval=5294909351029397815

where hval is inconsistent between (A)_k and (T)_k only when k=4 (which is because fhVal of (A)_k and rhVal of (T)_k are supposed to be the same but actually not, and vice versa).

This happens for large k values as well:

$ ./a.out 30
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=7041943336510964649, rhVal=16957560051676611659, hval=7041943336510964649
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=16957560051676611659, rhVal=7041943336510964649, hval=7041943336510964649

$ ./a.out 31
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=18446744068065231655, rhVal=18446744065269976257, hval=18446744065269976257
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=18446744065269976257, rhVal=18446744068065231655, hval=18446744065269976257

$ ./a.out 32
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=1477743835528954406, rhVal=15468376030029209044, hval=1477743835528954406
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=2311764568324791045, rhVal=14083886662562284090, hval=2311764568324791045

$ ./a.out 33
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA: fhVal=13430845866873192448, rhVal=9511644074189258751, hval=9511644074189258751
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT: fhVal=9511644074189258751, rhVal=13430845866873192448, hval=9511644074189258751

where the case of k=32 is problematic.

I believe the behavior above is due to the conditional processing based on remainder in NTF64:

inline uint64_t NTF64(const char * kmerSeq, const unsigned k) {

...

    unsigned remainder = k % 4;
    hVal = rolx(hVal, remainder);
    hVal = swapxbits033(hVal, remainder);
    if (remainder == 3) {
        uint8_t trimerLoc = 16 * convertTab[(unsigned char)kmerSeq[k - 3]] + 4 * convertTab[(unsigned char)kmerSeq[k - 2]] +  convertTab[(unsigned char)kmerSeq[k - 1]];
        hVal ^= trimerTab[trimerLoc];
    } else if (remainder == 2) {
        uint8_t dimerLoc = 4 * convertTab[(unsigned char)kmerSeq[k - 2]] +  convertTab[(unsigned char)kmerSeq[k - 1]];
        hVal ^= dimerTab[dimerLoc];
    }  else if (remainder == 1) {
        hVal ^= seedTab[(unsigned char)kmerSeq[k - 1]];
    }
    return hVal;
}

and I don't think we should update hVal when k % 4 == 0. After that fix, the values of hVal of (A)_k and (T)_k become identical for every case above. This part of the code was added in v2.2.0, and this bug does not appear in v2.1.0.

I will soon make a pull request (which is a very tiny change, though). I'm happy if you can check if the fix is correct.

yoshihikosuzuki avatar Jul 13 '22 07:07 yoshihikosuzuki