MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

easy-search (speed-up)

Open mmpust opened this issue 3 years ago • 0 comments

Expected Behavior

The functional annotation of representative sequences (75 GB) in a FASTA file with eggNOG and PFAM.

Current Behavior

I started with the eggNOG annotation, which is running for more than 120 hours now. Is there a way to speed the process up?

MMseqs Output (for bugs)

repSEQS.fna
Create directory repSEQS_eggnog.tmp
easy-search repSEQS.fna databases/eggnog repSEQS_eggnog.csv repSEQS_eggnog.tmp  \
 --dbtype 2 \
 --split-memory-limit 300G \
 --threads 56 \
 --remove-tmp-files false \
 --greedy-best-hits 1 

MMseqs Version:                        	8ff26f23a6b880df36cadb707890084503ceaffb
Substitution matrix                    	aa:blosum62.out,nucl:nucleotide.out
Add backtrace                          	false
Alignment mode                         	3
Alignment mode                         	0
Allow wrapped scoring                  	false
E-value threshold                      	0.001
Seq. id. threshold                     	0
Min alignment length                   	0
Seq. id. mode                          	0
Alternative alignments                 	0
Coverage threshold                     	0
Coverage mode                          	0
Max sequence length                    	65535
Compositional bias                     	1
Compositional bias                     	1
Max reject                             	2147483647
Max accept                             	2147483647
Include identical seq. id.             	false
Preload mode                           	0
Pseudo count a                         	substitution:1.100,context:1.400
Pseudo count b                         	substitution:4.100,context:5.800
Score bias                             	0
Realign hits                           	false
Realign score bias                     	-0.2
Realign max seqs                       	2147483647
Correlation score weight               	0
Gap open cost                          	aa:11,nucl:5
Gap extension cost                     	aa:1,nucl:2
Zdrop                                  	40
Threads                                	56
Compressed                             	0
Verbosity                              	3
Seed substitution matrix               	aa:VTML80.out,nucl:nucleotide.out
Sensitivity                            	5.7
k-mer length                           	0
k-score                                	seq:2147483647,prof:2147483647
Alphabet size                          	aa:21,nucl:5
Max results per query                  	300
Split database                         	0
Split mode                             	2
Split memory limit                     	300G
Diagonal scoring                       	true
Exact k-mer matching                   	0
Mask residues                          	1
Mask residues probability              	0.9
Mask lower case residues               	0
Minimum diagonal score                 	15
Selected taxa                          	
Spaced k-mers                          	1
Spaced k-mer pattern                   	
Local temporary path                   	
Rescore mode                           	0
Remove hits by seq. id. and coverage   	false
Sort results                           	0
Mask profile                           	1
Profile E-value threshold              	0.001
Global sequence weighting              	false
Allow deletions                        	false
Filter MSA                             	1
Use filter only at N seqs              	0
Maximum seq. id. threshold             	0.9
Minimum seq. id.                       	0.0
Minimum score per column               	-20
Minimum coverage                       	0
Select N most diverse seqs             	1000
Pseudo count mode                      	0
Gap pseudo count                       	10
Min codons in orf                      	30
Max codons in length                   	32734
Max orf gaps                           	2147483647
Contig start mode                      	2
Contig end mode                        	2
Orf start mode                         	1
Forward frames                         	1,2,3
Reverse frames                         	1,2,3
Translation table                      	1
Translate orf                          	0
Use all table starts                   	false
Offset of numeric ids                  	0
Create lookup                          	0
Add orf stop                           	false
Overlap between sequences              	0
Sequence split mode                    	1
Header split mode                      	0
Chain overlapping alignments           	0
Merge query                            	1
Search type                            	0
Search iterations                      	1
Start sensitivity                      	4
Search steps                           	1
Exhaustive search mode                 	false
Filter results during exhaustive search	0
Strand selection                       	1
LCA search mode                        	false
Disk space limit                       	0
MPI runner                             	
Force restart with latest tmp          	false
Remove temporary files                 	false
Alignment format                       	0
Format alignment output                	query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits
Database output                        	false
Overlap threshold                      	0
Database type                          	2
Shuffle input database                 	true
Createdb mode                          	0
Write lookup file                      	0
Greedy best hits                       	true

Alignment backtraces will be computed, since they were requested by output format.
createdb repSEQS.fna repSEQS_eggnog.tmp/16640501639052377423/query --dbtype 2 --shuffle 1 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3 

Converting sequences
[===================================================================================================	1 Mio. sequences processed
===================================================================================================	2 Mio. sequences processed
===================================================================================================	3 Mio. sequences processed
===================================================================================================	4 Mio. sequences processed
===================================================================================================	5 Mio. sequences processed
===================================================================================================	6 Mio. sequences processed
===================================================================================================	7 Mio. sequences processed
===================================================================================================	8 Mio. sequences processed
===================================================================================================	9 Mio. sequences processed
===================================================================================================	10 Mio. sequences processed
===================================================================================================	11 Mio. sequences processed
===================================================================================================	12 Mio. sequences processed
===================================================================================================	13 Mio. sequences processed
===================================================================================================	14 Mio. sequences processed
===================================================================================================	15 Mio. sequences processed
===================================================================================================	16 Mio. sequences processed
===================================================================================================	17 Mio. sequences processed
===================================================================================================	18 Mio. sequences processed
===================================================================================================	19 Mio. sequences processed
===================================================================================================	20 Mio. sequences processed
===================================================================================================	21 Mio. sequences processed
===================================================================================================	22 Mio. sequences processed
===================================================================================================	23 Mio. sequences processed
===================================================================================================	24 Mio. sequences processed
===================================================================================================	25 Mio. sequences processed
===================================================================================================	26 Mio. sequences processed
===================================================================================================	27 Mio. sequences processed
===================================================================================================	28 Mio. sequences processed
===================================================================================================	29 Mio. sequences processed
===================================================================================================	30 Mio. sequences processed
===================================================================================================	31 Mio. sequences processed
===================================================================================================	32 Mio. sequences processed
===================================================================================================	33 Mio. sequences processed
===================================================================================================	34 Mio. sequences processed
===================================================================================================	35 Mio. sequences processed
===================================================================================================	36 Mio. sequences processed
===================================================================================================	37 Mio. sequences processed
===================================================================================================	38 Mio. sequences processed
===================================================================================================	39 Mio. sequences processed
===================================================================================================	40 Mio. sequences processed
===================================================================================================	41 Mio. sequences processed
===================================================================================================	42 Mio. sequences processed
===================================================================================================	43 Mio. sequences processed
===================================================================================================	44 Mio. sequences processed
===================================================================================================	45 Mio. sequences processed
===================================================================================================	46 Mio. sequences processed
===================================================================================================	47 Mio. sequences processed
===================================================================================================	48 Mio. sequences processed
===================================================================================================	49 Mio. sequences processed
===================================================================================================	50 Mio. sequences processed
===================================================================================================	51 Mio. sequences processed
===================================================================================================	52 Mio. sequences processed
===================================================================================================	53 Mio. sequences processed
===================================================================================================	54 Mio. sequences processed
===================================================================================================	55 Mio. sequences processed
===================================================================================================	56 Mio. sequences processed
===================================================================================================	57 Mio. sequences processed
===================================================================================================	58 Mio. sequences processed
===================================================================================================	59 Mio. sequences processed
===================================================================================================	60 Mio. sequences processed
===================================================================================================	61 Mio. sequences processed
===================================================================================================	62 Mio. sequences processed
===================================================================================================	63 Mio. sequences processed
===================================================================================================	64 Mio. sequences processed
===================================================================================================	65 Mio. sequences processed
===================================================================================================	66 Mio. sequences processed
===================================================================================================	67 Mio. sequences processed
===================================================================================================	68 Mio. sequences processed
===================================================================================================	69 Mio. sequences processed
===================================================================================================	70 Mio. sequences processed
===================================================================================================	71 Mio. sequences processed
===================================================================================================	72 Mio. sequences processed
===================================================================================================	73 Mio. sequences processed
===================================================================================================	74 Mio. sequences processed
===================================================================================================	75 Mio. sequences processed
===================================================================================================	76 Mio. sequences processed
===================================================================================================	77 Mio. sequences processed
===================================================================================================	78 Mio. sequences processed
===================================================================================================	79 Mio. sequences processed
===================================================================================================	80 Mio. sequences processed
===================================================================================================	81 Mio. sequences processed
===================================================================================================	82 Mio. sequences processed
===================================================================================================	83 Mio. sequences processed
===================================================================================================	84 Mio. sequences processed
===================================================================================================	85 Mio. sequences processed
===================================================================================================	86 Mio. sequences processed
===================================================================================================	87 Mio. sequences processed
===================================================================================================	88 Mio. sequences processed
===================================================================================================	89 Mio. sequences processed
===================================================================================================	90 Mio. sequences processed
===================================================================================================	91 Mio. sequences processed
===================================================================================================	92 Mio. sequences processed
===================================================================================================	93 Mio. sequences processed
===================================================================================================	94 Mio. sequences processed
===================================================================================================	95 Mio. sequences processed
=============================
Time for merging to query_h: 0h 0m 32s 329ms
Time for merging to query: 0h 3m 16s 622ms
Database type: Nucleotide
Time for processing: 0h 27m 53s 813ms
Create directory repSEQS_eggnog.tmp/16640501639052377423/search_tmp
search repSEQS_eggnog.tmp/16640501639052377423/query databases/eggnog repSEQS_eggnog.tmp/16640501639052377423/result repSEQS_eggnog.tmp/16640501639052377423/search_tmp -a 1 --alignment-mode 3 --threads 56 -s 5.7 --split-memory-limit 300G --remove-tmp-files 0 

extractorfs repSEQS_eggnog.tmp/16640501639052377423/query repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/q_orfs_aa --min-length 30 --max-length 32734 --max-gaps 2147483647 --contig-start-mode 2 --contig-end-mode 2 --orf-start-mode 1 --forward-frames 1,2,3 --reverse-frames 1,2,3 --translation-table 1 --translate 1 --use-all-table-starts 0 --id-offset 0 --create-lookup 0 --threads 56 --compressed 0 -v 3 

[=================================================================] 95.29M 10m 53s 267ms
Time for merging to q_orfs_aa_h: 0h 14m 59s 800ms
Time for merging to q_orfs_aa: 0h 33m 4s 490ms
Time for processing: 1h 14m 4s 658ms
prefilter repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/q_orfs_aa databases/eggnog repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 5.7 -k 5 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 300G -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 56 --compressed 0 -v 3 

Query database size: 1303062545 type: Aminoacid
Estimated memory consumption: 2G
Target database size: 349750 type: Profile
Index table k-mer threshold: 82 at k-mer size 5 
Index table: counting k-mers
[=================================================================] 349.75K 1m 42s 520ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 349.75K 5m 18s 145ms
Index statistics
Entries:          14682023111
DB size:          84042 MB
Avg k-mer size:   3594.921651
Top 10 k-mers
    PPPPW	38077
    PPPWW	37617
    PPWPP	34827
    PPPGW	33942
    WWWPP	33931
    PPPDW	33516
    PPWPW	33505
    PPWRW	32205
    PWPPW	31944
    PPPQW	31811
Time for index table init: 0h 9m 20s 184ms
Process prefiltering step 1 of 1

k-mer similarity threshold: 82
Starting prefiltering scores calculation (step 1 of 1)
Query db start 1 to 1303062545
Target db start 1 to 349750
[=================================================================] 1.30B 86h 42m 2s 376ms

0.785483 k-mers per position
240012 DB matches per sequence
5731753 overflows
0 queries produce too many hits (truncated result)
269 sequences passed prefiltering per query sequence
300 median result list length
134238 sequences with 0 size result lists
Time for merging to pref: 0h 30m 15s 580ms
Time for processing: 88h 9m 11s 291ms
swapresults repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/q_orfs_aa databases/eggnog repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -e 0.001 --split-memory-limit 300G --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --threads 56 --compressed 0 --db-load-mode 0 -v 3 

Computing offsets.
[=================================================================] 1.30B 2h 8m 45s 98ms

Reading results.
[=================================================================] 1.30B 5h 47m 7s 401ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 26.35K 11m 16s 126ms

Time for merging to pref_swapped_0: 0h 40m 12s 625ms

Reading results.
[=================================================================] 1.30B 5h 40m 43s 346ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 32.57K 11m 9s 696ms

Time for merging to pref_swapped_1: 0h 38m 42s 418ms

Reading results.
[=================================================================] 1.30B 5h 39m 21s 0ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 27.87K 11m 16s 144ms

Time for merging to pref_swapped_2: 0h 39m 55s 667ms

Reading results.
[=================================================================] 1.30B 5h 36m 38s 949ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 25.02K 11m 10s 765ms

Time for merging to pref_swapped_3: 0h 38m 48s 751ms

Reading results.
[=================================================================] 1.30B 5h 35m 5s 521ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 28.27K 11m 14s 658ms

Time for merging to pref_swapped_4: 0h 40m 16s 359ms

Reading results.
[=================================================================] 1.30B 6h 4m 24s 557ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 32.79K 11m 19s 893ms

Time for merging to pref_swapped_5: 0h 40m 17s 973ms

Reading results.
[=================================================================] 1.30B 6h 3m 45s 577ms

Output database: repSEQS_eggnog.tmp/16640501639052377423/search_tmp/1950629703809443685/search/pref_swapped
[=================================================================] 22.66K 11m 12s 347ms

Time for merging to pref_swapped_6: 0h 40m 8s 817ms

Reading results.
[============================

Your Environment

  • Ubuntu 18.04
  • CPU platform: Intel Haswell x86/64
  • Boot disk size: 18 TB
  • 64 vCPU and 425984 MiB

mmpust avatar Aug 18 '22 11:08 mmpust