All-vs-All alignments with fake prefilter give unexpected sequence identities
Hi,
I'm using MMseqs2 for all-vs-all alignments. As indicated in the user guide pdf, I'm using the bash script to perform a fake prefiltering step before aligning. That seems to work, as all alignments are computed, but the results are a bit unexpected:
>Q8CD91
EERVVHWYFKLLDKNSSGDIGKKEIKPFKRFLRKKSKPKKCVKKFVEYCDMNNDKSITVQELMGCLGVTREEGKANTRKRHTPRGNAESSSSNRQPRKQG
>Q95JC9
GPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPGPPPPGPAPPGARPPPGPPPPPPGPSPPRPPPGPPPQ
I didn't change much of the defaults
mmseqs createdb --dbtype 1 dummyfasta.fasta seq_db
mmseqs_fake_prefilter.sh seq_db seq_db pref
mmseqs align seq_db seq_db pref align_db --alignment-mode 3 -e inf --min-seq-id 0.3
mmseqs convertalis seq_db seq_db align_db check_alignments.tab
query target fident alnlen mismatch gapopen qstart qend tstart tend evalue bits
Q8CD91 Q8CD91 1.000 100 0 0 1 100 1 100 1.707E-71 202
Q8CD91 Q95JC9 1.000 2 0 0 83 84 91 92 4.610E+01 9
Q95JC9 Q95JC9 1.000 100 0 0 1 100 1 100 7.088E-35 108
Q95JC9 Q8CD91 0.333 3 1 0 11 13 21 23 1.508E+02 7
The second alignment's fident makes no sense to me. I understand that the e-value is high too, but I really want all pairwise % identities returned so I don't use it for filtering. Do I need to use a different configuration to make this work, or is mmseqs2 just not suitable for this all-vs-all identity task?
That seems about right? It aligned two residues successfully (from 83 to 84). You might want to demand some coverage thresholds (-c/-cov-mode) or a minimum aln length threshold (--min-aln-len).
Thanks for pointing out this was correct! I did not realize that this was just a denominator issue. --seq-id-mode was what I needed to change, as the default does not make much sense for this application.