MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

Standard taxonomy taking a long time for a single bin

Open andrewjmc opened this issue 4 years ago • 5 comments

Expected Behavior

I thought mmseqs would be quite quick to annotate a single MAG

Current Behavior

I am using the SemiBin GTDB database for mmseqs (https://zenodo.org/record/4751564/files/GTDB_v95.tar.gz) and standard taxonomy command line with 24 threads and >100 Gb RAM. mmseqs2 is progressing very slowly for this single bin (330kbases only, quoting hours to run). The authors of SemiBin quote the step which includes mmseq taxonomic assignment as taking 90-120 minutes on similar sizes servers for contigs from whole datasets.

Does runtime scales with the search database and not the query size? Have I done something wrong? All advice gratefully received.

MMseqs Output (for bugs)

MMseqs Version:                         13.45111
ORF filter                              1
ORF filter e-value                      100
ORF filter sensitivity                  2
LCA mode                                3
Taxonomy output mode                    0
Majority threshold                      0.5
Vote mode                               1
LCA ranks
Column with taxonomic lineage           0
Compressed                              0
Threads                                 24
Verbosity                               3
Taxon blacklist                         12908:unclassified sequences,28384:other sequences
Substitution matrix                     nucl:nucleotide.out,aa:blosum62.out
Add backtrace                           false
Alignment mode                          1
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       1
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0
Coverage mode                           0
Max sequence length                     65535
Compositional bias                      1
Max reject                              5
Max accept                              30
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          1
Pseudo count b                          1.5
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Gap open cost                           nucl:5,aa:11
Gap extension cost                      nucl:2,aa:1
Zdrop                                   40
Seed substitution matrix                nucl:nucleotide.out,aa:VTML80.out
Sensitivity                             2
k-mer length                            0
k-score                                 2147483647
Alphabet size                           nucl:5,aa:21
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask lower case residues                0
Minimum diagonal score                  15
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false

extractorfs bin.1.mmseqs.db /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 24 --compressed 0 -v 3

[=================================================================] 100.00% 6 0s 21ms
Time for merging to orfs_aa_h: 0h 0m 0s 276ms
Time for merging to orfs_aa: 0h 0m 0s 415ms
Time for processing: 0h 0m 1s 438ms
prefilter /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_aa ../../../../resources/GTDB/mmseqs_gtdb/GTDB /rds/general/ephemeral/user/ephemeral//9711778946736545179/orfs_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 2 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 24 --compressed 0 -v 3

Query database size: 5696 type: Aminoacid
Target split mode. Searching through 3 splits
Estimated memory consumption: 124G
Target database size: 106052079 type: Aminoacid
Process prefiltering step 1 of 3

Index table k-mer threshold: 163 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 35.35M 7m 55s 640ms
Index table: Masked residues: 89908004
Index table: fill
[===>                                                             ] 5.00% 1.77M eta 4h 48m 2s

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 13.45111
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory): ~190 Gb, HPC
  • Operating system and version: CentOS

andrewjmc avatar Aug 06 '21 10:08 andrewjmc

Can you check how much swap spaces is being used (free -h). I guess the automatic memory limit detection is going wrong somehow. Do you set memory limits in the cluster environments? Can you restart without/higher memory limits? I am not sure if SemiBin exposes MMseqs2 options to users, but you could set --split-memory-Limit so it processes the GTDB in smaller chunks. You would set this parameter to about 70-80% of total allowed RAM.

milot-mirdita avatar Aug 06 '21 11:08 milot-mirdita

Thanks. 180 G available. I was cheating here and just testing on a shared node. I was also just running mmseqs myself from the command line (but SemiBin does the same). Do you mean larger chunks, so it runs quicker, rather than smaller chunks?

andrewjmc avatar Aug 06 '21 11:08 andrewjmc

No the issue is (probably) that it’s using too much memory and runtime is thus degrading heavily. Larger chunks would mean a faster runtime if enough RAM was available. But with limited RAM smaller chunks will require less RAM and thus process quicker.

milot-mirdita avatar Aug 06 '21 11:08 milot-mirdita

OK, I'll try that thanks!

andrewjmc avatar Aug 06 '21 11:08 andrewjmc

Hi, I am also finding mmseqs taxonomy runs much slower than expected. I have run a metagenome assembled genome (MAG) as a query (after turning it into a mmseqs database) against nr (created using mmseqs databases). The query is 4.3M and it took about 4hrs to complete running. @milot-mirdita could you explain which number you took 70-80% of? When I run free -h, I get:

               total        used        free      shared  buff/cache   available
Mem:           188Gi       1.1Gi       939Mi       2.0Mi       186Gi       186Gi
Swap:          8.0Gi        85Mi       7.9Gi

I have some memory intensive programs running right now (bwa mem and metaSPAdes), so maybe this is slowing things down?

Thank you!

liamfriar avatar May 13 '23 17:05 liamfriar