inconsistent result in comparison with jellyFish and kmer_counter
Correction: Actually, all three produced different results.
Hello, I compared the outputs from DSK to the ones generated by JellyFish and by KMER_COUNTER. I used the same .fa file for all three and generated the 7 mers by each of the packages. While all three produced the same number of 7-mers (8192 counts), only JellyFish and KMER_COUNTER produce identical k-mer profiles (i.e. the same k-mers and their frequencies). However, DSK is different by 1344 kmers from both of them. All kmers were sorted lexicographically and I used set difference to calculate the results. Since two out of three produced the same results, I was wondering if there is anything DSK does differently? I know how DSK counts canonical k-mers and tried to search by reversed k-mer string but still, the output isn't there. Could you please let me know if there is something I am missing, perhaps in the flag setting? Thank you.
Hi, thanks for bringing this up. Does it only occur with 7-mers or did you also see it with higher lengths? e.g. 21-mers. I must say I almost never test with that small k-mer sizes.
Hi, no I only checked 7-mers because I needed this value but the way I understand it should not depend on the k-mer length. Even if the same sequence is analyzed by different software, there should not be any inconsistencies as to the frequencies of k-mers present in a sequence for a given value of k. Isn't that true? Thank you.
hi, keep in mind that DSK and Jellyfish do not normalize kmers the same way. See: https://github.com/GATB/dsk/#kmers-and-their-reverse-complements
Also, DSK discards by default any kmer seen only once, and you can modify that behavior by giving as parameter: -abundance-min 1.
If the issue remains, I'd appreciate to have a small test file to further debug it.