Autometa icon indicating copy to clipboard operation
Autometa copied to clipboard

Look into using taxon result limited blastp instead of top-n limited

Open chasemc opened this issue 2 years ago • 1 comments

Taxon assignment is currently performed by running DIAMOND blastp on contig proteins, finding the LCA and majority vote.

Currently, the blast result simply returns the top 200 database hits which is not ideal if there are many high-scoring matches

That is currently decided here in python: https://github.com/KwanLab/Autometa/blob/b1ec370c44e7accfc917aaad1a685133537fd915/autometa/common/external/diamond.py#L57-L150 And, once merged, here in Nextflow: https://github.com/KwanLab/Autometa/blob/7618ee18bb31c4de9af81cf2fc093ff2d1ff0ca2/conf/modules.config#L42

To investigate:

It looks like DIAMOND can build the database with taxon info and limit to 1 hit per species, which is more ideal.

e.g. build... (note this has been done and the taxon_mapped_nr_diamond.dmnd file exists on the server)

 diamond makedb \
    --in $DIR_PREFIX/Databases/autometa_databases/nr \
    --db $DIR_PREFIX/Databases/autometa_databases/taxon_mapped_nr_diamond \
    --taxonmap $DIR_PREFIX/autometa_databases/prot.accession2taxid.FULL.gz \
    --taxonnodes $DIR_PREFIX/autometa_databases/nodes.dmp \
    --threads 40

then run diamonnd blastp --taxon-k 1 ... ?

--taxon-k maximum number of targets to report per species

Additionally, this could potentially remove need for a taxid lookup step in Autometa by changing DIAMOND blastp's output to --outfmt 6 staxids


Sorta related to https://github.com/KwanLab/Autometa/issues/11

chasemc avatar Apr 05 '23 16:04 chasemc

Also to double check--What is the minimum diamond version corresponding to the availability of the --taxonmap and --taxonnodes parameters? The diamond blastp parameters?

I think it also relevant to mention, there is another diamond blastp parameter --unal 1 that may simplify issue #305

diamond --version --> diamond version 2.0.14

Under aligner options:

diamond help
# Aligner options:
--unal                   report unaligned queries (0=no, 1=yes)

This will write * under the subject sequence ID column

evanroyrees avatar Apr 05 '23 18:04 evanroyrees