MMSeqs taxonomy coverage value
I would like to assign a taxonomic label to my protein sequences using the blast NR database and the mmseqs taxonomy command available in the docker image (quay.io/microbiome-informatics/mmseqs:2.13). I noticed that the default coverage value is zero according to the help page.
-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
However, if I set the -c parameter to 0.8, which sounds reasonable to me, then all of my sequences are labelled as no rank unclassified.
Full command:
mmseqs taxonomy queryDB ${MMSEQS2_DATABASE_DIR} taxresults.database tmp --lca-ranks superkingdom,phylum,class,order,family,genus,species,subspecies --threads 28
My questions are:
- Doesn't it always make sense to increase the
-cparameter to reduce spurious hits? - How can I inspect the alignment of the best hit?
- Does easy-taxonomy also use the same default
lca-modeor a different one?
- I am looking through the code and seeing some bugs in how coverage works within the alignment for taxonomy. Ignoring if this makes sense or not, its definitely broken code-wise.
It also would not be super well defined which coverage to compute, since we do multiple alignments with the 2bLCA procedure. What is currently implemented (however broken) is that it would try to compute the coverage between the extracted subfragment of the database against the other database hits.
https://github.com/soedinglab/MMseqs2/wiki#the-concept-of-lca In the figure here this would be the coverage of the pink hit 1 fragment versus Hit 2, 3 and 4. I am not sure which coverage would make the most sense to compute and in any case would require us to run new benchmarks.
-
You need to pass
--tax-output-mode 2to also compute and store the alignments. They will be placed attaxresults.database_alnin your case. -
easy-taxonomy and taxonomy behave the same, the only difference is that the former takes FASTA input while the later only takes MMseqs2 databases.
The main algorithmic difference depends on the input type though. With nucleotide input it will use the contig taxonomy procedure described in the MMseqs2 taxonomy paper, this includes the fast ORF-prefiltering and the taxonomy majority voting.
The ORF-prefiltering can be overaggressive for short-reads, our previous recommendation was to disable the ORF-prefiltering with --orf-filter 0 if you give it short read input. We are currently developing a better fix in #832 currently that should not require messing with this parameter.
For protein input, the ORF-filtering and majority voting does not happen.