nucleotide easy-search, query bp length issues
First of all hi, and thanks for the great tool!
I am working on blastn -task blastn-like search for barcoding data, and noticed that MMseqs2 changes behavior depending on the length of sequences in the queries. The top hits for the longer sequences have low similarity even when exact matches exist in the target DB. Meanwhile, short sequences match the right targets. But when I use --exhaustive-search, the pattern is reversed: longer sequences perform normally, while short ones get very few hits, often not the best ones.
I run MMseqs2 17-b804f on cluster. The basic command (log in the end):
mmseqs easy-search --search-type 3 -k 7 -e 1E-50 --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend "${MMS_QUERY_DB}" "${MMS_TARGET_DB}" "${MMS_OUT}_k7.tsv" tmp --threads 8
The queue is 3 sets of ca. 150 sequences each, covering the same samples (fungi):
- full region (ITS), ca. 500-650 bp
- 2 different parts of ITS, ca. 200-350 bp each
Example (Expected hits are Allopsalliota geesterani for sample1 and Limonomyces roseipellis for sample2):
>ITS-full_sample1
CGGAAGGATCATTATTGAATGTTGCCTTGATAGGTTGTCGCTGGTCCTGTGAAAACAATGCAGGGCATGTGCACACCTGTCTTGACTTCAAATTCATCCACCTGTGCACCTTTTGTAGTCTTTTTTTCAGGGGGCTGGGGGAAAAACGGCAGCTTCTCAGGCTGTCGCTTGAAACCCTCCCACTGGAAGGCTATGTTTTTATCTACACACATACCATTTAGAATGTTACAGAATGTTGTCAATGGGCACTTGTGGCCTATATAAAAATCTATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGCCATTCCGAGGAGCATGCCTGTTTGAGTGTCATTATATTCTCAACCCCCCACCAGCTTTTGTGGCTGGCGCTTGGGGGCTTGGATGTGGAGGTTTTTGCTGGCCAAGTAAAGTCAGCTCCTCTGAAATGCATCAGCGGAACCGTTTGCGATCCGTCACAGGTGTGATAATTATCTACGCCAGTGGGTTGCTCTCTGTATGTTCAGCTTATAACTGTCCTGTTGTAAACGGACAAACAACTTTTGAACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA
>ITS-full_sample2
CTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAGATGTCAAAAGAGTTGTCCATTACAGACGATTAGAAGCTTTGTATTCACTCATCACACGGTACACATTAAAGTATACACTGCCGAAGCGGTGATTATAGCTTTCGATCCATTGGTAAGACATGTATCACACCGTAGATCTAATAGCTGCGCGATCCATAGCTAATGCATTTAAGAGGAGCTGACTAGGCCAGCAAGCCTCCAAATCCAACACTACTTCGATTTTGCAATCTGGTAGGTTGAGAGATTCATGACACTCAAACAGGCATGCTCCCCGGAATACCAGGGAGCGCAAGGTGCGTTCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTGCGTTCTTCATCGATGCGAGAGCCAAGAGATCCGTTGTTGAAAGTTGTATTTATTGCGTTAAGCAAAGTAGACATTCTAAGACATACAAGAGTTTATAAAAAATCGAGCCGACAGAGAACGCTAGTCGAGACCTTTCGGCCCTTTCGCTAATCCCCCATTAAACTCGCATAGGTTCACAGGTGTATAGATGAGTGAAAGAGCGTGCACATCCCCCGAGAGGGCAGCGACAGCTCAGTAAACTCGTTAATGATCCTTCC
>ITS1_sample1
CGGAAGGATCATTATTGAATGTTGCCTTGATAGGTTGTCGCTGGTCCTGTGAAAACAATGCAGGGCATGTGCACACCTGTCTTGACTTCAAATTCATCCACCTGTGCACCTTTTGTAGTCTTTTTTTCAGGGGGCTGGGGGAAAAACGGCAGCTTCTCAGGCTGTCGCTTGAAACCCTCCCACTGGAAGGCTATGTTTTTATCTACACACATACCATTTAGAATGTTACAGAATGTTGTCAATGGGCACTTGTGGCCTATATAAAAATCTATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCAT
>ITS1_sample2
CGGAAGGATCATTAACGAGTTTACTGAGCTGTCGCTGCCCTCTCGGGGGATGTGCACGCTCTTTCACTCATCTATACACCTGTGAACCTATGCGAGTTTAATGGGGGATTAGCGAAAGGGCCGAAAGGTCTCGACTAGCGTTCTCTGTCGGCTCGATTTTTTATAAACTCTTGTATGTCTTAGAATGTCTACTTTGCTTAACGCAATAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCAT
>ITS2_sample1
GAACGCACCTTGCGCTCCTTGGCCATTCCGAGGAGCATGCCTGTTTGAGTGTCATTATATTCTCAACCCCCCACCAGCTTTTGTGGCTGGCGCTTGGGGGCTTGGATGTGGAGGTTTTTGCTGGCCAAGTAAAGTCAGCTCCTCTGAAATGCATCAGCGGAACCGTTTGCGATCCGTCACAGGTGTGATAATTATCTACGCCAGTGGGTTGCTCTCTGTATGTTCAGCTTATAACTGTCCTGTTGTAAACGGACAAACAACTTTTGAACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA
>ITS2_sample2
CTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAGATGTCAAAAGAGTTGTCCATTACAGACGATTAGAAGCTTTGTATTCACTCATCACACGGTACACATTAAAGTATACACTGCCGAAGCGGTGATTATAGCTTTCGATCCATTGGTAAGACATGTATCACACCGTAGATCTAATAGCTGCGCGATCCATAGCTAATGCATTTAAGAGGAGCTGACTAGGCCAGCAAGCCTCCAAATCCAACACTACTTCGATTTTGCAATCTGGTAGGTTGAGAGATTCATGACACTCAAACAGGCATGCTCCCCGGAATACCAGGGAGCGCAAGGTGCGTT
Target DB: EUKARYOME 1.9.4 General_EUK_ITS https://eukaryome.org/generalfasta/ BTW, DB contains 1570910 sequences, while MMseqs reports 1575463.
Top hit correctness for different parameters:
| analysis | ITS-full sample1 | ITS-full sample2 | ITS1 sample1 | ITS1 sample2 | ITS2 sample1 | ITS2 sample2 |
|---|---|---|---|---|---|---|
-k 15(auto-chosen) |
miss | miss | ok | ok | ok | ok |
-k 7 |
ok | miss | ok | ok | ok | ok |
--exhaustive-search |
ok | ok | ok | ok (fery few hits) | miss (fery few hits) | ok (fery few hits) |
Changing --max-seqs to 1000 didn't help.
In --exhaustive-search there was no difference between -k 7 or 15 (auto-chosen)
I also tried --num-iterations 3 (now I know that it is not for nucleotides) and the search died at ITS1_sample2 with "Query sequence 4 has a result with no diagonal information" - similar to #747
BTW, it is often not clear what actually has effect in nucleotide search, e.g., no warnings are thrown at -s - would be nice to improve that somehow.
Also, maybe someone could help to understand the options below?
--exhaustive-search-filter - not sure what it does. Search dies if it's on.
--alph-size - should it be changed if sequences include ambiguities like YRSWKM?
--zdrop - cannot wrap my head around how exactly it influences the analysis.
--shuffle - why this is useful?
--sort-results - is 0 by default, but the results seem to be sorted anyway?
--strand - is it for target DB? Query seems to be reverse-complemented by default.
Log
<PATH_OUT>/test_input_k7.tsv exists and will be overwritten easy-search --search-type 3 -k 7 -e 1E-50 --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend <PATH_QUERY>/test_input.fas <PATH_DB>/EUK_ITS <PATH_OUT>/test_input_k7.tsv tmp --threads 8
MMseqs Version: GITDIR-NOTFOUND
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Add backtrace false
Alignment mode 3
Alignment mode 0
Allow wrapped scoring false
E-value threshold 1e-50
Seq. id. threshold 0
Min alignment length 0
Seq. id. mode 0
Alternative alignments 0
Coverage threshold 0
Coverage mode 0
Max sequence length 65535
Compositional bias 1
Compositional bias 1
Max reject 2147483647
Max accept 2147483647
Include identical seq. id. false
Preload mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Score bias 0
Realign hits false
Realign score bias -0.2
Realign max seqs 2147483647
Correlation score weight 0
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Zdrop 40
Threads 8
Compressed 0
Verbosity 3
Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out
Sensitivity 5.7
k-mer length 7
Target search mode 0
k-score seq:2147483647,prof:2147483647
Alphabet size aa:21,nucl:5
Max results per query 300
Split database 0
Split mode 2
Split memory limit 0
Diagonal scoring true
Exact k-mer matching 0
Mask residues 1
Mask residues probability 0.9
Mask lower case residues 0
Mask lower letter repeating N times 0
Minimum diagonal score 15
Selected taxa
Spaced k-mers 1
Spaced k-mer pattern
Local temporary path
Use GPU 0
Use GPU server 0
Wait for GPU server 600
Prefilter mode 0
Rescore mode 0
Remove hits by seq. id. and coverage false
Sort results 0
Mask profile 1
Profile E-value threshold 0.001
Global sequence weighting false
Allow deletions false
Filter MSA 1
Use filter only at N seqs 0
Maximum seq. id. threshold 0.9
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Pseudo count mode 0
Profile output mode 0
Min codons in orf 30
Max codons in length 32734
Max orf gaps 2147483647
Contig start mode 2
Contig end mode 2
Orf start mode 1
Forward frames 1,2,3
Reverse frames 1,2,3
Translation table 1
Translate orf 0
Use all table starts false
Offset of numeric ids 0
Create lookup 0
Overlap between sequences 0
Sequence split mode 1
Header split mode 0
Chain overlapping alignments 0
Merge query 1
Search type 3
Search iterations 1
Start sensitivity 4
Search steps 1
Exhaustive search mode false
Filter results during exhaustive search 0
Strand selection 1
LCA search mode false
Disk space limit 0
MPI runner
Force restart with latest tmp false
Remove temporary files true
Translation mode 0
Alignment format 4
Format alignment output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend
Database output false
Overlap threshold 0
Database type 0
Shuffle input database true
Createdb mode 0
Write lookup file 0
Greedy best hits false
createdb <PATH_QUERY>/test_input.fas tmp/13753214276772974164/query --dbtype 0 --shuffle 0 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3
Converting sequences
Time for merging to query_h: 0h 0m 0s 1ms Time for merging to query: 0h 0m 0s 1ms Database type: Nucleotide Time for processing: 0h 0m 0s 57ms Create directory tmp/13753214276772974164/search_tmp search tmp/13753214276772974164/query <PATH_DB>/EUK_ITS tmp/13753214276772974164/result tmp/13753214276772974164/search_tmp --alignment-mode 3 -e 1e-50 --threads 8 -s 5.7 -k 7 --search-type 3 --remove-tmp-files 1
splitsequence <PATH_DB>/EUK_ITS tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 8 --compressed 0 -v 3
[=================================================================] 100.00% 1.57M 5s 517ms
Time for merging to target_seqs_split_h: 0h 0m 0s 396ms
Time for merging to target_seqs_split: 0h 0m 0s 409ms
Time for processing: 0h 0m 7s 304ms
extractframes tmp/13753214276772974164/query tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs --forward-frames 1 --reverse-frames 1 --translation-table 1 --translate 0 --create-lookup 0 --threads 8 --compressed 0 -v 3
[=================================================================] 100.00% 6 0s 1ms
Time for merging to query_seqs_h: 0h 0m 0s 12ms ] 40.00% 3 eta 0s
Time for merging to query_seqs: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 52ms
splitsequence tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 8 --compressed 0 -v 3
Time for processing: 0h 0m 0s 4ms prefilter tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 8 --compressed 0 -v 3 -s 5.7
Query database size: 12 type: Nucleotide
Estimated memory consumption: 11G
Target database size: 1575463 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 7
Index table: counting k-mers
[=================================================================] 100.00% 1.58M 48s 855ms
Index table: Masked residues: 9718991
Index table: fill
[=================================================================] 100.00% 1.58M 57s 194ms
Index statistics
Entries: 1549630265
DB size: 8867 MB
Avg k-mer size: 94581.925354
Top 10 k-mers
GAGAGCG 1451935
AAACGGC 1382555
GGTTCGG 1370954
CACATAG 1230604
GTATTAG 1198666
GAAGTTG 1168961
TCATCTG 1101689
GCGGATG 1094306
TTACAGG 1089495
CCGGGGG 1075647
Time for index table init: 0h 1m 54s 160ms
Process prefiltering step 1 of 1
k-mer similarity threshold: 0 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 12 Target db start 1 to 1575463 [=================================================================] 100.00% 12 0s 892ms
0.972497 k-mers per position 69250224 DB matches per sequence 12 overflows 300 sequences passed prefiltering per query sequence 300 median result list length 0 sequences with 0 size result lists Time for merging to pref_0: 0h 0m 0s 10ms Time for processing: 0h 1m 58s 563ms align tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 tmp/13753214276772974164/search_tmp/12199762753845478940/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 1e-50 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 10000 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 8 --compressed 0 -v 3
Compute score, coverage and sequence identity
Query database size: 12 type: Nucleotide
Target database size: 1575463 type: Nucleotide
Calculation of alignments
[=================================================================] 100.00% 12 0s 82ms
Time for merging to aln: 0h 0m 0s 11ms
3600 alignments calculated
1535 sequence pairs passed the thresholds (0.426389 of overall calculated)
127.916664 hits per query sequence
Time for processing: 0h 0m 0s 543ms
rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 -v 3
Time for processing: 0h 0m 0s 2ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/aln_0 -v 3
Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/input_0 -v 3
Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/aln_merge -v 3
Time for processing: 0h 0m 0s 0ms offsetalignment tmp/13753214276772974164/query tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split <PATH_DB>/EUK_ITS tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/aln tmp/13753214276772974164/result --chain-alignments 0 --merge-query 1 --search-type 3 --threads 8 --compressed 0 --db-load-mode 0 -v 3
Computing ORF lookup Computing contig offsets Computing contig lookup Time for contig lookup: 0h 0m 0s 0ms Writing results to: tmp/13753214276772974164/result [=================================================================] 100.00% 6 0s 2ms
Time for merging to result: 0h 0m 0s 7ms Time for processing: 0h 0m 0s 144ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/q_orfs -v 3
Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/q_orfs_aa -v 3
Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/t_orfs -v 3
Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/t_orfs_aa -v 3
Time for processing: 0h 0m 0s 0ms <PATH_OUT>/test_input_k7.tsv exists and will be overwritten convertalis tmp/13753214276772974164/query <PATH_DB>/EUK_ITS tmp/13753214276772974164/result <PATH_OUT>/test_input_k7.tsv --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend --translation-table 1 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --db-output 0 --db-load-mode 0 --search-type 3 --threads 8 --compressed 0 -v 3
[=================================================================] 100.00% 6 0s 4ms
Time for merging to test_input_k7.tsv: 0h 0m 0s 10ms
Time for processing: 0h 0m 0s 194ms
rmdb tmp/13753214276772974164/result -v 3
Time for processing: 0h 0m 0s 2ms rmdb tmp/13753214276772974164/query -v 3
Time for processing: 0h 0m 0s 1ms rmdb tmp/13753214276772974164/query_h -v 3
Time for processing: 0h 0m 0s 0ms
I don't think --exhaustive-search is supposed to work for nucleotides at all. It's likely doing a blastx-like search internally and then doing a profile search on the translated proteins.
I think the most likely issue is --max-seqs, the maximum number that the prefilter passes on to the alignment module. Especially in highly redundant databases (such as a blastx-like translated protein database) this will cause missing many hits.
Now I see why --exhaustive-search works better - it swaps query and target. And you are right, high --max-seqs somewhere between 50 000 and 100 000 started to pick up good matches. Is it possible ballpark optimal --max-seqs value from the data? Finding it by iterations is costly.
I followed #901 and --spaced-kmer-mode 0 did the trick for nucleotides; the rest of params were default (-k 15 --max-seqs 300).