MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Estimated running time for createdb

Open yonghanyu opened this issue 4 years ago • 2 comments

Hi, there

I am currently using mmseqs to cluster more than 20 billion protein sequences. I intend to complete the task by running created, clusthash and linclust module. However, the createdb module (oneline faa sequence with index only) itself takes more than 700 cpu hours and does not finish at this moment. In the paper, the mmseqs cluster 1.6 billion sequences with around 10 hours. I am wondering whether it includes the time for createdb and clusthash steps?

Besides, is there any suggestion on how to speed up the createdb module?

yonghanyu avatar Oct 08 '21 19:10 yonghanyu

MMseqs2 has a limitation to databases of at most ~4 billion sequences (UINT_MAX). You have to cluster in multiple splits. @martin-steinegger should be able to help with an example.

milot-mirdita avatar Oct 08 '21 23:10 milot-mirdita

MMseqs2 has a limitation to databases of at most ~4 billion sequences (UINT_MAX). You have to cluster in multiple splits. @martin-steinegger should be able to help with an example.

Hi, sorry for bothering but any update on this?

For now I am splitting the protein into multiple fasta, each containing at most 2billion sequences. I will then use the clusthash and linclust on each split. Finally, some tools like mergedb will be used and do a further clusterhash/linclust. I am wondering whether this is the correct way to do since I cannot find related information in the documentation.

yonghanyu avatar Oct 10 '21 19:10 yonghanyu

Is there any update on this? Suggestions for creating a mmseqs2 database of this size?

snayfach avatar Oct 11 '23 17:10 snayfach

Please create a new issue describing your use case.

If you want to search more than ~4 billion sequences at once, I'd recommend to first cluster (in multiple stages and subsequently merging the clusterings) to dereplicate the database first and then searching against this smaller database.

Alternatively, I'd recommend to create multiple databases and searching each individually. We have had multiple requests to implement a parameter that would set the real DB size for E-value calculation externally. Maybe something would help for your use case?

milot-mirdita avatar Oct 15 '23 08:10 milot-mirdita