Seeking input on clustering multiple samples
Hi,
I have a ~150 samples of protein fasta files, which I'd like to generate a non-redundant set of proteins from. I have tried concatenating the files to run the following:
mmseqs easy-linclust --cov-mode 0 -c 0.8 --min-seq-id 0.3 all_nomis_proteins.faa mmseqs2_output tmp
However, the concatenated file has 400 Mio genes, making it computationally infeasible. What may be the best approach here?
- cluster smaller sets of samples and then re-cluster?
- for point-1 above, can one provide multiple inputs or should concatenate the smaller sets?
- for
re-clusteringafter an initial round, is it possible to output thenon-redundantfasta file?
Thank you for your input! -Susheel
400 Million is not a big problem for linclust, it should run on any reasonably sized server in a day or so. We have clustered billions of sequences with linclust previously.
However it won’t reach 30% sequence Identity, for that you will need the normal clustering workflow. That will run linclust first and then use the MMseqs2 search algorithms to cluster further. That might run for a few days/weeks.
@milot-mirdita I'm running into segmentation fault issues when running with 2 Nodes carrying 128 cpus each - total memory == 448G. Further review revealed that i'm running out of memory.
Any recommendations on what kind of reasonable-sized server you are referring to? Any by 'normal clustering' you mean the mmseqs cluster?
I'm trying to reproduce the cascaded-clustering approach described here: https://elifesciences.org/articles/67667#bib118. Which might that be?
Thank you!
Can you try to run it on a single node (without MPI, etc). Issues in MPI support might have gone unnoticed since we switched to 128 core machines.
Yes, i mean mmseqs (easy-)cluster with normal clustering. That one should also successfully finish eventually on a single of these compute nodes.
Can you please post the full log output? Maybe something else went wrong.
Okay, will give this a go and report back. Thank you!!
@milot-mirdita Below is the log file from one of the runs. Looks like it's running out of memory, before the job dies.

And here is the job efficiency report from SLURM
Job ID: 359779
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 128
CPU Utilized: 10:13:07
CPU Efficiency: 1.00% of 42-16:44:48 core-walltime
Job Wall-clock time: 08:00:21
Memory Utilized: 206.26 GB
Memory Efficiency: 92.08% of 224.00 GB
Do you think merely providing more cores will do the trick or is there something else that I'm missing?
Thank you!
UPDATE: Tried the run with more cores, but across 6 nodes - didn't really expect it to work given your last comment, but was worth the shot
Job ID: 360184
Cluster: aion
User/Group: sbusi/clusterusers
State: OUT_OF_MEMORY (exit code 0)
Nodes: 6
Cores per node: 128
CPU Utilized: 09:26:34
CPU Efficiency: 0.23% of 172-18:08:00 core-walltime
Job Wall-clock time: 05:23:55
Memory Utilized: 1.21 TB (estimated maximum)
Memory Efficiency: 92.41% of 1.31 TB (1.75 GB/core)
Update: Managed to successfully run the clustering using a full 3 TB node with 112 threads. The SLURM efficiency output is below:
Job ID: 2976046
Cluster: iris
User/Group: sbusi/clusterusers
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 112
CPU Utilized: 73-15:37:58
CPU Efficiency: 21.99% of 334-22:48:00 core-walltime
Job Wall-clock time: 2-23:46:30
Memory Utilized: 197.78 GB
Memory Efficiency: 6.70% of 2.88 TB
I'm running into a similar issue but with contigs. Samples with even only a handful of contigs larger than ~200,000 bp seem to crash mmseqs easy-clust because of memory (segmentation fault error). similarly getting poor memory efficiency
Job ID: 1002827 Cluster: tinkercliffs User/Group: clb21565/clb21565 State: FAILED (exit code 1) Nodes: 1 Cores per node: 128 CPU Utilized: 03:00:21 CPU Efficiency: 25.62% of 11:44:00 core-walltime Job Wall-clock time: 00:05:30 Memory Utilized: 13.43 GB Memory Efficiency: 5.59% of 240.00 GB