Look into using taxon result limited blastp instead of top-n limited
Taxon assignment is currently performed by running DIAMOND blastp on contig proteins, finding the LCA and majority vote.
Currently, the blast result simply returns the top 200 database hits which is not ideal if there are many high-scoring matches
That is currently decided here in python: https://github.com/KwanLab/Autometa/blob/b1ec370c44e7accfc917aaad1a685133537fd915/autometa/common/external/diamond.py#L57-L150 And, once merged, here in Nextflow: https://github.com/KwanLab/Autometa/blob/7618ee18bb31c4de9af81cf2fc093ff2d1ff0ca2/conf/modules.config#L42
To investigate:
It looks like DIAMOND can build the database with taxon info and limit to 1 hit per species, which is more ideal.
e.g. build... (note this has been done and the taxon_mapped_nr_diamond.dmnd file exists on the server)
diamond makedb \
--in $DIR_PREFIX/Databases/autometa_databases/nr \
--db $DIR_PREFIX/Databases/autometa_databases/taxon_mapped_nr_diamond \
--taxonmap $DIR_PREFIX/autometa_databases/prot.accession2taxid.FULL.gz \
--taxonnodes $DIR_PREFIX/autometa_databases/nodes.dmp \
--threads 40
then run diamonnd blastp --taxon-k 1 ... ?
--taxon-k maximum number of targets to report per species
Additionally, this could potentially remove need for a taxid lookup step in Autometa by changing DIAMOND blastp's output to --outfmt 6 staxids
Sorta related to https://github.com/KwanLab/Autometa/issues/11
Also to double check--What is the minimum diamond version corresponding to the availability of the --taxonmap and --taxonnodes parameters? The diamond blastp parameters?
I think it also relevant to mention, there is another diamond blastp parameter --unal 1 that may simplify issue #305
diamond --version --> diamond version 2.0.14
Under aligner options:
diamond help
# Aligner options:
--unal report unaligned queries (0=no, 1=yes)
This will write * under the subject sequence ID column