funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Issues with multiple transcripts: MEROPs, BUSCO, PFAM and dbCAN/CAZyme annotations missing for T2 transcripts from final annotation file

Open calizilla opened this issue 1 year ago • 1 comments

All transcripts in the final anotation.txt file produced by funannotate 'annotate' have been converted to T1 transcripts, where the input files have both T1 and T2. In other final output files (eg proteins.fa, gff3, gbk etc) the T2 transcript IDs have been retained. This is not a huge problem and is easily fixed if desired, but the major problem is that for all T2 transcripts, they are missing annotations from BUSCO, dbCAN/CAZyme, PFAM and MEROPs.

The unifying feature for these 4 database annotations is that they were executed by the funannotate 'anotate' step, and not excecuted manually (see below).

I ran funannotate workflow on a fungal genome on HPC. Due to some issues at various steps, I manually ran some of the annotations (ie with a separate custom script, not executed as part of funannotate), being careful to follow the same parameters as applied in the funannotate python code. Manul annotations were run for:

  • coding quarry
  • phobius
  • antismash
  • interproscan
  • eggnog

At antiSMASH, I encountered a previosuly described error due to multiple transcripts. I followed the suggestion by @sunnycqcn in this antiSMASH issue to use agat to keep only the longest transcript, and then ran 'funannotate fix' to update the gbk and tbl files. After agat, there were 579 T2 transcripts and the remainder were T1.

After completing the funannotate workflow, I chanced upon noticing that some genes were missing annotations that were present in the manual annotation output files, and that in all cases, these were genes that had the 'T2' designation in the proteins.fa file. I wrote a script to check the annotations in the 'annotate_misc' directory against the annotations in the final 'annotate_results/annotations.txt' file for the 579 T2 genes. Every gene that had an annotation in 'annotate_misc' against any of BUSCO, PFAM, MEROPs or dbCAN was missing the annotation being included in the 'annotate_results/annotations.txt'.

Providing pre-computed annotation files to the funannotate 'annotate' step is a valid approach, using the parameters:

  --eggnog             Eggnog-mapper annotations file (if NOT installed)
  --antismash          antiSMASH secondary metabolism results (GBK file from output)
  --iprscan            InterProScan5 XML file
  --phobius            Phobius pre-computed results (if phobius NOT installed)

So while I can't be sure this bug would occur for users relying solely on the funannotate codebase (ie not executing some of the annotations manually) , it seems likely that this bug may affect others who have had to perform steps manually, as I did.

One proposed solution would be to adjust the way the 'annotate' step treats the transcript IDs, and not perform a conversion of everything to T1 when compiling all the annotations.

calizilla avatar Jul 12 '24 03:07 calizilla

I also encountered a similar problem with the annotation step.

I ran Eggnog-mapper and InterProScan locally. In the resulting GFF file, the EC_number, EggNog, COG, CAZyme .... annotations are completely missing. However all those annotations are present in annotations.txt file.

(edit) Problem solved for me: the Eggnog-mapper output file was bad formated :/

VDaric avatar Apr 08 '25 10:04 VDaric