Standard taxonomy taking a long time for a single bin
Expected Behavior
I thought mmseqs would be quite quick to annotate a single MAG
Current Behavior
I am using the SemiBin GTDB database for mmseqs (https://zenodo.org/record/4751564/files/GTDB_v95.tar.gz) and standard taxonomy command line with 24 threads and >100 Gb RAM. mmseqs2 is progressing very slowly for this single bin (330kbases only, quoting hours to run). The authors of SemiBin quote the step which includes mmseq taxonomic assignment as taking 90-120 minutes on similar sizes servers for contigs from whole datasets.
Does runtime scales with the search database and not the query size? Have I done something wrong? All advice gratefully received.
MMseqs Output (for bugs)
MMseqs Version: 13.45111
ORF filter 1
ORF filter e-value 100
ORF filter sensitivity 2
LCA mode 3
Taxonomy output mode 0
Majority threshold 0.5
Vote mode 1
LCA ranks
Column with taxonomic lineage 0
Compressed 0
Threads 24
Verbosity 3
Taxon blacklist 12908:unclassified sequences,28384:other sequences
Substitution matrix nucl:nucleotide.out,aa:blosum62.out
Add backtrace false
Alignment mode 1
Alignment mode 0
Allow wrapped scoring false
E-value threshold 1
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Max reject 5
Max accept 30
Include identical seq. id. false
Preload mode 0
Pseudo count a 1
Pseudo count b 1.5
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Gap open cost nucl:5,aa:11
Gap extension cost nucl:2,aa:1
Zdrop 40
Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out
Sensitivity 2
k-mer length 0
k-score 2147483647
Alphabet size nucl:5,aa:21
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask lower case residues 0
Minimum diagonal score 15
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Maximum seq. id. threshold 0.9
Minimum seq. id. 0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Add orf stop false
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 0
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files false
extractorfs bin.1.mmseqs.db /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 24 --compressed 0 -v 3
[=================================================================] 100.00% 6 0s 21ms
Time for merging to orfs_aa_h: 0h 0m 0s 276ms
Time for merging to orfs_aa: 0h 0m 0s 415ms
Time for processing: 0h 0m 1s 438ms
prefilter /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_aa ../../../../resources/GTDB/mmseqs_gtdb/GTDB /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 2 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 24 --compressed 0 -v 3
Query database size: 5696 type: Aminoacid
Target split mode. Searching through 3 splits
Estimated memory consumption: 124G
Target database size: 106052079 type: Aminoacid
Process prefiltering step 1 of 3
Index table k-mer threshold: 163 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 35.35M 7m 55s 640ms
Index table: Masked residues: 89908004
Index table: fill
[===> ] 5.00% 1.77M eta 4h 48m 2s
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
- Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 13.45111
- Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda
- Server specifications (especially CPU support for AVX2/SSE and amount of system memory): ~190 Gb, HPC
- Operating system and version: CentOS
Can you check how much swap spaces is being used (free -h). I guess the automatic memory limit detection is going wrong somehow. Do you set memory limits in the cluster environments? Can you restart without/higher memory limits? I am not sure if SemiBin exposes MMseqs2 options to users, but you could set --split-memory-Limit so it processes the GTDB in smaller chunks. You would set this parameter to about 70-80% of total allowed RAM.
Thanks. 180 G available. I was cheating here and just testing on a shared node. I was also just running mmseqs myself from the command line (but SemiBin does the same). Do you mean larger chunks, so it runs quicker, rather than smaller chunks?
No the issue is (probably) that it’s using too much memory and runtime is thus degrading heavily. Larger chunks would mean a faster runtime if enough RAM was available. But with limited RAM smaller chunks will require less RAM and thus process quicker.
OK, I'll try that thanks!
Hi, I am also finding mmseqs taxonomy runs much slower than expected. I have run a metagenome assembled genome (MAG) as a query (after turning it into a mmseqs database) against nr (created using mmseqs databases). The query is 4.3M and it took about 4hrs to complete running. @milot-mirdita could you explain which number you took 70-80% of? When I run free -h, I get:
total used free shared buff/cache available
Mem: 188Gi 1.1Gi 939Mi 2.0Mi 186Gi 186Gi
Swap: 8.0Gi 85Mi 7.9Gi
I have some memory intensive programs running right now (bwa mem and metaSPAdes), so maybe this is slowing things down?
Thank you!