cuttlefish icon indicating copy to clipboard operation
cuttlefish copied to clipboard

Extra large kmer

Open apaytuvi opened this issue 3 years ago • 6 comments

Dear cuttlefish authors,

Thank you for this useful tool. I have a large database of genomes and I want to reduce the redundancy to reduce the computational time, improve speed and reduce RAM usage of a mapping against such a big database. I tried cuttlefish, and is useful but I would like a larger kmer, let's say, e.g. 1000. Why? Long-read technology requires long sequences for a correct mapping, but by setting low kmer lengths such as 127 most sequences remain that size, which is clearly not enough for long-read mapping. Do you have a suggestion for that?

Thanks,

apaytuvi avatar Oct 21 '22 08:10 apaytuvi

Hi @apaytuvi,

Thanks for using cuttlefish! I'll incorporate the capability of using extra-large k-mers into cuttlefish; but that might take a little time. In the meantime, I can try posting a hack in a separate branch for you to try it out. Would it work you?

Regards.

jamshed avatar Oct 24 '22 18:10 jamshed

That would be great. Thank you so much!

apaytuvi avatar Oct 25 '22 06:10 apaytuvi

Hi @apaytuvi: we've found some bug(s) in the initial k-mer enumeration phase of cuttlefish, only occurring with huge k-values (e.g. with k >= 1000)—hence the delay! I'll get back to this once we could address the issue.

jamshed avatar Oct 30 '22 20:10 jamshed

Thanks Jamshed for these efforts! No problem, I'll wait. Thanks again.


From: Jamshed Khan @.> Sent: Sunday, October 30, 2022 9:08:56 PM To: COMBINE-lab/cuttlefish @.> Cc: apaytuvi @.>; Mention @.> Subject: Re: [COMBINE-lab/cuttlefish] Extra large kmer (Issue #22)

Hi @apaytuvihttps://github.com/apaytuvi: we've found some bug(s) in the initial k-mer enumeration phase of cuttlefish, only occurring with huge k-values (e.g. with k >= 1000)—hence the delay! I'll get back to this once we could address the issue.

— Reply to this email directly, view it on GitHubhttps://github.com/COMBINE-lab/cuttlefish/issues/22#issuecomment-1296340451, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACVVIQPBSXA2G2IECMQW3O3WF3IVRANCNFSM6AAAAAARK6RQIQ. You are receiving this because you were mentioned.Message ID: @.***>

apaytuvi avatar Oct 31 '22 05:10 apaytuvi

It seems the bug has been solved and this feature should be available. Could you please confirm that @jamshed? Thanks a lot!

apaytuvi avatar Nov 28 '22 10:11 apaytuvi

Hi @apaytuvi: sorry for the delay in response!

I've pushed a new branch, extra-large-k, with the required support. This needs to be compiled from source, as instructed here. But the cmake line needs to be replaced with the following

cmake -DINSTANCE_COUNT=256 -DCMAKE_INSTALL_PREFIX=../ ..

Currently this supports k up-to 1023. Let me know if you want to try with even larger k—we can extend the range for that. But note that, the installation takes quite some time with large k (you may use make -j install to make it faster with more threads). Also, the execution performance is also quite time- and disk-heavy—specifically, the initial (k+1)-mer and k-mer enumeration stages.

Let me know if you could test it successfully!

jamshed avatar Dec 05 '22 16:12 jamshed