MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

nucleotide easy-search, query bp length issues

Open sav-che opened this issue 6 months ago • 3 comments

First of all hi, and thanks for the great tool!

I am working on blastn -task blastn-like search for barcoding data, and noticed that MMseqs2 changes behavior depending on the length of sequences in the queries. The top hits for the longer sequences have low similarity even when exact matches exist in the target DB. Meanwhile, short sequences match the right targets. But when I use --exhaustive-search, the pattern is reversed: longer sequences perform normally, while short ones get very few hits, often not the best ones.


I run MMseqs2 17-b804f on cluster. The basic command (log in the end):

mmseqs easy-search --search-type 3 -k 7 -e 1E-50 --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend "${MMS_QUERY_DB}" "${MMS_TARGET_DB}" "${MMS_OUT}_k7.tsv" tmp --threads 8

The queue is 3 sets of ca. 150 sequences each, covering the same samples (fungi):

  • full region (ITS), ca. 500-650 bp
  • 2 different parts of ITS, ca. 200-350 bp each

Example (Expected hits are Allopsalliota geesterani for sample1 and Limonomyces roseipellis for sample2):

>ITS-full_sample1
CGGAAGGATCATTATTGAATGTTGCCTTGATAGGTTGTCGCTGGTCCTGTGAAAACAATGCAGGGCATGTGCACACCTGTCTTGACTTCAAATTCATCCACCTGTGCACCTTTTGTAGTCTTTTTTTCAGGGGGCTGGGGGAAAAACGGCAGCTTCTCAGGCTGTCGCTTGAAACCCTCCCACTGGAAGGCTATGTTTTTATCTACACACATACCATTTAGAATGTTACAGAATGTTGTCAATGGGCACTTGTGGCCTATATAAAAATCTATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGCCATTCCGAGGAGCATGCCTGTTTGAGTGTCATTATATTCTCAACCCCCCACCAGCTTTTGTGGCTGGCGCTTGGGGGCTTGGATGTGGAGGTTTTTGCTGGCCAAGTAAAGTCAGCTCCTCTGAAATGCATCAGCGGAACCGTTTGCGATCCGTCACAGGTGTGATAATTATCTACGCCAGTGGGTTGCTCTCTGTATGTTCAGCTTATAACTGTCCTGTTGTAAACGGACAAACAACTTTTGAACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA
>ITS-full_sample2
CTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAGATGTCAAAAGAGTTGTCCATTACAGACGATTAGAAGCTTTGTATTCACTCATCACACGGTACACATTAAAGTATACACTGCCGAAGCGGTGATTATAGCTTTCGATCCATTGGTAAGACATGTATCACACCGTAGATCTAATAGCTGCGCGATCCATAGCTAATGCATTTAAGAGGAGCTGACTAGGCCAGCAAGCCTCCAAATCCAACACTACTTCGATTTTGCAATCTGGTAGGTTGAGAGATTCATGACACTCAAACAGGCATGCTCCCCGGAATACCAGGGAGCGCAAGGTGCGTTCAAAGATTCGATGATTCACTGAATTCTGCAATTCACATTACTTATCGCATTTCGCTGCGTTCTTCATCGATGCGAGAGCCAAGAGATCCGTTGTTGAAAGTTGTATTTATTGCGTTAAGCAAAGTAGACATTCTAAGACATACAAGAGTTTATAAAAAATCGAGCCGACAGAGAACGCTAGTCGAGACCTTTCGGCCCTTTCGCTAATCCCCCATTAAACTCGCATAGGTTCACAGGTGTATAGATGAGTGAAAGAGCGTGCACATCCCCCGAGAGGGCAGCGACAGCTCAGTAAACTCGTTAATGATCCTTCC
>ITS1_sample1
CGGAAGGATCATTATTGAATGTTGCCTTGATAGGTTGTCGCTGGTCCTGTGAAAACAATGCAGGGCATGTGCACACCTGTCTTGACTTCAAATTCATCCACCTGTGCACCTTTTGTAGTCTTTTTTTCAGGGGGCTGGGGGAAAAACGGCAGCTTCTCAGGCTGTCGCTTGAAACCCTCCCACTGGAAGGCTATGTTTTTATCTACACACATACCATTTAGAATGTTACAGAATGTTGTCAATGGGCACTTGTGGCCTATATAAAAATCTATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCAT
>ITS1_sample2
CGGAAGGATCATTAACGAGTTTACTGAGCTGTCGCTGCCCTCTCGGGGGATGTGCACGCTCTTTCACTCATCTATACACCTGTGAACCTATGCGAGTTTAATGGGGGATTAGCGAAAGGGCCGAAAGGTCTCGACTAGCGTTCTCTGTCGGCTCGATTTTTTATAAACTCTTGTATGTCTTAGAATGTCTACTTTGCTTAACGCAATAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCAT
>ITS2_sample1
GAACGCACCTTGCGCTCCTTGGCCATTCCGAGGAGCATGCCTGTTTGAGTGTCATTATATTCTCAACCCCCCACCAGCTTTTGTGGCTGGCGCTTGGGGGCTTGGATGTGGAGGTTTTTGCTGGCCAAGTAAAGTCAGCTCCTCTGAAATGCATCAGCGGAACCGTTTGCGATCCGTCACAGGTGTGATAATTATCTACGCCAGTGGGTTGCTCTCTGTATGTTCAGCTTATAACTGTCCTGTTGTAAACGGACAAACAACTTTTGAACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA
>ITS2_sample2
CTTAAGTTCAGCGGGTAGTCCTACCTGATTTGAGGTCAGATGTCAAAAGAGTTGTCCATTACAGACGATTAGAAGCTTTGTATTCACTCATCACACGGTACACATTAAAGTATACACTGCCGAAGCGGTGATTATAGCTTTCGATCCATTGGTAAGACATGTATCACACCGTAGATCTAATAGCTGCGCGATCCATAGCTAATGCATTTAAGAGGAGCTGACTAGGCCAGCAAGCCTCCAAATCCAACACTACTTCGATTTTGCAATCTGGTAGGTTGAGAGATTCATGACACTCAAACAGGCATGCTCCCCGGAATACCAGGGAGCGCAAGGTGCGTT

Target DB: EUKARYOME 1.9.4 General_EUK_ITS https://eukaryome.org/generalfasta/ BTW, DB contains 1570910 sequences, while MMseqs reports 1575463.


Top hit correctness for different parameters:

analysis ITS-full sample1 ITS-full sample2 ITS1 sample1 ITS1 sample2 ITS2 sample1 ITS2 sample2
-k 15(auto-chosen) miss miss ok ok ok ok
-k 7 ok miss ok ok ok ok
--exhaustive-search ok ok ok ok (fery few hits) miss (fery few hits) ok (fery few hits)

Changing --max-seqs to 1000 didn't help. In --exhaustive-search there was no difference between -k 7 or 15 (auto-chosen)

I also tried --num-iterations 3 (now I know that it is not for nucleotides) and the search died at ITS1_sample2 with "Query sequence 4 has a result with no diagonal information" - similar to #747 BTW, it is often not clear what actually has effect in nucleotide search, e.g., no warnings are thrown at -s - would be nice to improve that somehow.

Also, maybe someone could help to understand the options below? --exhaustive-search-filter - not sure what it does. Search dies if it's on. --alph-size - should it be changed if sequences include ambiguities like YRSWKM? --zdrop - cannot wrap my head around how exactly it influences the analysis. --shuffle - why this is useful? --sort-results - is 0 by default, but the results seem to be sorted anyway? --strand - is it for target DB? Query seems to be reverse-complemented by default.


Log

<PATH_OUT>/test_input_k7.tsv exists and will be overwritten easy-search --search-type 3 -k 7 -e 1E-50 --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend <PATH_QUERY>/test_input.fas <PATH_DB>/EUK_ITS <PATH_OUT>/test_input_k7.tsv tmp --threads 8

MMseqs Version: GITDIR-NOTFOUND Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace false Alignment mode 3 Alignment mode 0 Allow wrapped scoring false E-value threshold 1e-50 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0 Coverage mode 0 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 8 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 5.7 k-mer length 7 Target search mode 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Mask lower letter repeating N times 0 Minimum diagonal score 15 Selected taxa
Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Use GPU 0 Use GPU server 0 Wait for GPU server 600 Prefilter mode 0 Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.001 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Profile output mode 0 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 3 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files true Translation mode 0 Alignment format 4 Format alignment output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend Database output false Overlap threshold 0 Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 0 Greedy best hits false

createdb <PATH_QUERY>/test_input.fas tmp/13753214276772974164/query --dbtype 0 --shuffle 0 --createdb-mode 0 --write-lookup 0 --id-offset 0 --compressed 0 -v 3

Converting sequences

Time for merging to query_h: 0h 0m 0s 1ms Time for merging to query: 0h 0m 0s 1ms Database type: Nucleotide Time for processing: 0h 0m 0s 57ms Create directory tmp/13753214276772974164/search_tmp search tmp/13753214276772974164/query <PATH_DB>/EUK_ITS tmp/13753214276772974164/result tmp/13753214276772974164/search_tmp --alignment-mode 3 -e 1e-50 --threads 8 -s 5.7 -k 7 --search-type 3 --remove-tmp-files 1

splitsequence <PATH_DB>/EUK_ITS tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 1.57M 5s 517ms
Time for merging to target_seqs_split_h: 0h 0m 0s 396ms Time for merging to target_seqs_split: 0h 0m 0s 409ms Time for processing: 0h 0m 7s 304ms extractframes tmp/13753214276772974164/query tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs --forward-frames 1 --reverse-frames 1 --translation-table 1 --translate 0 --create-lookup 0 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 6 0s 1ms
Time for merging to query_seqs_h: 0h 0m 0s 12ms ] 40.00% 3 eta 0s
Time for merging to query_seqs: 0h 0m 0s 12ms Time for processing: 0h 0m 0s 52ms splitsequence tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --headers-split-mode 0 --create-lookup 0 --threads 8 --compressed 0 -v 3

Time for processing: 0h 0m 0s 4ms prefilter tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 7 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 10000 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 1 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --mask-n-repeat 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 8 --compressed 0 -v 3 -s 5.7

Query database size: 12 type: Nucleotide Estimated memory consumption: 11G Target database size: 1575463 type: Nucleotide Index table k-mer threshold: 0 at k-mer size 7 Index table: counting k-mers [=================================================================] 100.00% 1.58M 48s 855ms
Index table: Masked residues: 9718991 Index table: fill [=================================================================] 100.00% 1.58M 57s 194ms
Index statistics Entries: 1549630265 DB size: 8867 MB Avg k-mer size: 94581.925354 Top 10 k-mers GAGAGCG 1451935 AAACGGC 1382555 GGTTCGG 1370954 CACATAG 1230604 GTATTAG 1198666 GAAGTTG 1168961 TCATCTG 1101689 GCGGATG 1094306 TTACAGG 1089495 CCGGGGG 1075647 Time for index table init: 0h 1m 54s 160ms Process prefiltering step 1 of 1

k-mer similarity threshold: 0 Starting prefiltering scores calculation (step 1 of 1) Query db start 1 to 12 Target db start 1 to 1575463 [=================================================================] 100.00% 12 0s 892ms

0.972497 k-mers per position 69250224 DB matches per sequence 12 overflows 300 sequences passed prefiltering per query sequence 300 median result list length 0 sequences with 0 size result lists Time for merging to pref_0: 0h 0m 0s 10ms Time for processing: 0h 1m 58s 563ms align tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 tmp/13753214276772974164/search_tmp/12199762753845478940/aln --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' -a 0 --alignment-mode 3 --alignment-output-mode 0 --wrapped-scoring 0 -e 1e-50 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 10000 --comp-bias-corr 1 --comp-bias-corr-scale 1 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --score-bias 0 --realign 0 --realign-score-bias -0.2 --realign-max-seqs 2147483647 --corr-score-weight 0 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --zdrop 40 --threads 8 --compressed 0 -v 3

Compute score, coverage and sequence identity Query database size: 12 type: Nucleotide Target database size: 1575463 type: Nucleotide Calculation of alignments [=================================================================] 100.00% 12 0s 82ms
Time for merging to aln: 0h 0m 0s 11ms 3600 alignments calculated 1535 sequence pairs passed the thresholds (0.426389 of overall calculated) 127.916664 hits per query sequence Time for processing: 0h 0m 0s 543ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/pref_0 -v 3

Time for processing: 0h 0m 0s 2ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/aln_0 -v 3

Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/input_0 -v 3

Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/search/aln_merge -v 3

Time for processing: 0h 0m 0s 0ms offsetalignment tmp/13753214276772974164/query tmp/13753214276772974164/search_tmp/12199762753845478940/query_seqs_split <PATH_DB>/EUK_ITS tmp/13753214276772974164/search_tmp/12199762753845478940/target_seqs_split tmp/13753214276772974164/search_tmp/12199762753845478940/aln tmp/13753214276772974164/result --chain-alignments 0 --merge-query 1 --search-type 3 --threads 8 --compressed 0 --db-load-mode 0 -v 3

Computing ORF lookup Computing contig offsets Computing contig lookup Time for contig lookup: 0h 0m 0s 0ms Writing results to: tmp/13753214276772974164/result [=================================================================] 100.00% 6 0s 2ms

Time for merging to result: 0h 0m 0s 7ms Time for processing: 0h 0m 0s 144ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/q_orfs -v 3

Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/q_orfs_aa -v 3

Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/t_orfs -v 3

Time for processing: 0h 0m 0s 0ms rmdb tmp/13753214276772974164/search_tmp/12199762753845478940/t_orfs_aa -v 3

Time for processing: 0h 0m 0s 0ms <PATH_OUT>/test_input_k7.tsv exists and will be overwritten convertalis tmp/13753214276772974164/query <PATH_DB>/EUK_ITS tmp/13753214276772974164/result <PATH_OUT>/test_input_k7.tsv --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --format-mode 4 --format-output query,target,pident,alnlen,evalue,bits,qlen,tlen,qcov,mismatch,gapopen,qstart,qend,tstart,tend --translation-table 1 --gap-open aa:11,nucl:5 --gap-extend aa:1,nucl:2 --db-output 0 --db-load-mode 0 --search-type 3 --threads 8 --compressed 0 -v 3

[=================================================================] 100.00% 6 0s 4ms
Time for merging to test_input_k7.tsv: 0h 0m 0s 10ms Time for processing: 0h 0m 0s 194ms rmdb tmp/13753214276772974164/result -v 3

Time for processing: 0h 0m 0s 2ms rmdb tmp/13753214276772974164/query -v 3

Time for processing: 0h 0m 0s 1ms rmdb tmp/13753214276772974164/query_h -v 3

Time for processing: 0h 0m 0s 0ms

sav-che avatar Jul 21 '25 20:07 sav-che

I don't think --exhaustive-search is supposed to work for nucleotides at all. It's likely doing a blastx-like search internally and then doing a profile search on the translated proteins.

I think the most likely issue is --max-seqs, the maximum number that the prefilter passes on to the alignment module. Especially in highly redundant databases (such as a blastx-like translated protein database) this will cause missing many hits.

milot-mirdita avatar Jul 29 '25 18:07 milot-mirdita

Now I see why --exhaustive-search works better - it swaps query and target. And you are right, high --max-seqs somewhere between 50 000 and 100 000 started to pick up good matches. Is it possible ballpark optimal --max-seqs value from the data? Finding it by iterations is costly.

sav-che avatar Jul 30 '25 20:07 sav-che

I followed #901 and --spaced-kmer-mode 0 did the trick for nucleotides; the rest of params were default (-k 15 --max-seqs 300).

sav-che avatar Aug 01 '25 10:08 sav-che