MMSeqs taxonomy coverage value

Open pbelmann opened this issue 2 years ago • 1 comments

I would like to assign a taxonomic label to my protein sequences using the blast NR database and the mmseqs taxonomy command available in the docker image (quay.io/microbiome-informatics/mmseqs:2.13). I noticed that the default coverage value is zero according to the help page. -c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

However, if I set the -c parameter to 0.8, which sounds reasonable to me, then all of my sequences are labelled as no rank unclassified.

Full command:

mmseqs taxonomy queryDB ${MMSEQS2_DATABASE_DIR} taxresults.database tmp  --lca-ranks superkingdom,phylum,class,order,family,genus,species,subspecies   --threads 28

My questions are:

Doesn't it always make sense to increase the -c parameter to reduce spurious hits?
How can I inspect the alignment of the best hit?
Does easy-taxonomy also use the same default lca-mode or a different one?

Sep 13 '23 14:09 pbelmann

I am looking through the code and seeing some bugs in how coverage works within the alignment for taxonomy. Ignoring if this makes sense or not, its definitely broken code-wise.

It also would not be super well defined which coverage to compute, since we do multiple alignments with the 2bLCA procedure. What is currently implemented (however broken) is that it would try to compute the coverage between the extracted subfragment of the database against the other database hits.

https://github.com/soedinglab/MMseqs2/wiki#the-concept-of-lca In the figure here this would be the coverage of the pink hit 1 fragment versus Hit 2, 3 and 4. I am not sure which coverage would make the most sense to compute and in any case would require us to run new benchmarks.

You need to pass --tax-output-mode 2 to also compute and store the alignments. They will be placed at taxresults.database_aln in your case.
easy-taxonomy and taxonomy behave the same, the only difference is that the former takes FASTA input while the later only takes MMseqs2 databases.

The main algorithmic difference depends on the input type though. With nucleotide input it will use the contig taxonomy procedure described in the MMseqs2 taxonomy paper, this includes the fast ORF-prefiltering and the taxonomy majority voting.

The ORF-prefiltering can be overaggressive for short-reads, our previous recommendation was to disable the ORF-prefiltering with --orf-filter 0 if you give it short read input. We are currently developing a better fix in #832 currently that should not require messing with this parameter.

For protein input, the ORF-filtering and majority voting does not happen.

Apr 18 '24 10:04 milot-mirdita