Extra large kmer
Dear cuttlefish authors,
Thank you for this useful tool. I have a large database of genomes and I want to reduce the redundancy to reduce the computational time, improve speed and reduce RAM usage of a mapping against such a big database. I tried cuttlefish, and is useful but I would like a larger kmer, let's say, e.g. 1000. Why? Long-read technology requires long sequences for a correct mapping, but by setting low kmer lengths such as 127 most sequences remain that size, which is clearly not enough for long-read mapping. Do you have a suggestion for that?
Thanks,
Hi @apaytuvi,
Thanks for using cuttlefish! I'll incorporate the capability of using extra-large k-mers into cuttlefish; but that might take a little time. In the meantime, I can try posting a hack in a separate branch for you to try it out. Would it work you?
Regards.
That would be great. Thank you so much!
Hi @apaytuvi: we've found some bug(s) in the initial k-mer enumeration phase of cuttlefish, only occurring with huge k-values (e.g. with k >= 1000)—hence the delay! I'll get back to this once we could address the issue.
Thanks Jamshed for these efforts! No problem, I'll wait. Thanks again.
From: Jamshed Khan @.> Sent: Sunday, October 30, 2022 9:08:56 PM To: COMBINE-lab/cuttlefish @.> Cc: apaytuvi @.>; Mention @.> Subject: Re: [COMBINE-lab/cuttlefish] Extra large kmer (Issue #22)
Hi @apaytuvihttps://github.com/apaytuvi: we've found some bug(s) in the initial k-mer enumeration phase of cuttlefish, only occurring with huge k-values (e.g. with k >= 1000)—hence the delay! I'll get back to this once we could address the issue.
— Reply to this email directly, view it on GitHubhttps://github.com/COMBINE-lab/cuttlefish/issues/22#issuecomment-1296340451, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACVVIQPBSXA2G2IECMQW3O3WF3IVRANCNFSM6AAAAAARK6RQIQ. You are receiving this because you were mentioned.Message ID: @.***>
It seems the bug has been solved and this feature should be available. Could you please confirm that @jamshed? Thanks a lot!
Hi @apaytuvi: sorry for the delay in response!
I've pushed a new branch, extra-large-k, with the required support. This needs to be compiled from source, as instructed here. But the cmake line needs to be replaced with the following
cmake -DINSTANCE_COUNT=256 -DCMAKE_INSTALL_PREFIX=../ ..
Currently this supports k up-to 1023. Let me know if you want to try with even larger k—we can extend the range for that.
But note that, the installation takes quite some time with large k (you may use make -j install to make it faster with more threads). Also, the execution performance is also quite time- and disk-heavy—specifically, the initial (k+1)-mer and k-mer enumeration stages.
Let me know if you could test it successfully!