Unstable hash values for k>64
Hi, thanks for inventing this great algorithm! And glad to see ntHash2 is also published.
I've used ntHash in several projects of mine, specifically, using a Golang implementation (by @will-rowe) of ntHash v1.0.4. It's really fast!
I just find that the hash values are unstable for k > 64. I guess it's related to the algorithm itself, cause it happens to be 64, not another number. I am not familiar with C/C++, so I can't check the original implementation. Here are steps with tools using the Golang implementation to reproduce the issue.
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" | seqkit stats
file format type num_seqs sum_len min_len avg_len max_len
- FASTA DNA 1 65 65 65 65
# compute canonical ntHash with https://github.com/shenwei356/unikmer
$ echo -ne ">s\nACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
14872199115326727818 ****
# now we add one base in the start position.
$ echo -ne ">s\naACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5042997269030491403
13219011773654478434 ****
$ echo -ne ">s\ncACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5252439041897790003
1248486797404628174 ****
$ echo -ne ">s\ngACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
6432712771380638299
10232415768797241538 ****
$ echo -ne ">s\ntACGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 65 -l | unikmer view
5774425108696765737
11299031900349869606 ****
While for k=64, it's stable (7718595180140858881):
$ echo -ne ">s\nCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
7718595180140858881
$ echo -ne ">s\naCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8752650216170443135
7718595180140858881
$ echo -ne ">s\ncCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
9222164657016850147
7718595180140858881
$ echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
echo -ne ">s\ntCGAAGAATACACAACTATGTACCGGGGGGCTTTGGGGAGAAAAAGGAAAAAATAAAATCTTTAA\n" \
| unikmer count -H -K -k 64 -l | unikmer view
8329673441993552238
7718595180140858881
Hi @shenwei356 and thanks for your comments! This was a known issue with the early versions of ntHash (see this issue and also ntHash2's paper). We've fixed the periodicity problem in ntHash2. The Golang implementation probably contains the old bug that causes your issue.
Unfortunately, we don't have a Golang implementation for ntHash2. That said, ntHash2 can be used in Python with btllib.
Thanks @shenwei356 for reporting this! We'll try to sync up with @will-rowe and @luizirber to have new releases for Golang and Rust implementation.
I see, thank you all! Currently, I'll limit the k-mer size below 64 for my projects.
Apparently, SWIG (the tool used in btllib to generate Python wrappers) also supports Go. I'll try it out on ntHash and post any updates here.