funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Orthofiller output incorporation?

Open Dikaryotic opened this issue 3 years ago • 1 comments

Hi John,

This isn't an issue so much as a question/query. I've got a few genomes from the same genus that I'm predicting and annotating. I've used Orthofiller (https://github.com/mpdunne/orthofiller) to help find missing gene predictions after my initial individual prediction run's with Funannotate. Do you have any thought's on how I may then be able to incorporate that output for my subsequent annotation runs? The output of Orthofiller is in GTF format.

Here is a snippet of one of the output files: scaffold_1 AUGUSTUS CDS 4294697 4294940 0.53 + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS CDS 4295052 4295071 0.29 + 2 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS intron 4294941 4295051 0.28 + . transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS start_codon 4294697 4294699 . + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS stop_codon 4295069 4295071 . + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS CDS 4321123 4321587 0.97 - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS start_codon 4321585 4321587 . - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS stop_codon 4321123 4321125 . - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS CDS 4380025 4380213 1 - 0 transcript_id "orthofiller_g3.t1"; gene_id "orthofiller_g3"; scaffold_1 AUGUSTUS CDS 4380262 4380278 0.65 - 2 transcript_id "orthofiller_g3.t1"; gene_id "orthofiller_g3";

My motivation for this is that I have a novel native species with a fairly good genome assembly, but no RNA seq. However there are two other decent genomes that do have public RNA-seq data available for them and the predictions for those genomes looks far more robust and correct. I was hoping to utilize that information to help with my own novel genome.

Cheers, Chris

Dikaryotic avatar Sep 11 '22 22:09 Dikaryotic

Hi @Dikaryotic.

Had you noticed that some genes were specifically missing? Are these gene models from orthofiller complete, ie do they have proper start/stop? This looks to be AUGUSTUS gtf format, so you could perhaps convert to standard GFF3 and then pass the gene models to --other_gff and give them some weight to be passed directly to EVM. I think Augustus has a gtf to GFF3 conversion script, gtf2gff.pl < input.gtf --out=out.gff3 --gff3.

But if they are not complete gene models, than you do not want to pass them off to EVM that way -- but rather they would be considered protein evidence -- which will only be used by EVM to score de novo gene models.

If the species/isolates have high nucleotide identity you could try to map the RNA-seq reads to your genome with funannotate train, that would only work well if they are indeed very closely related (ie different isolates of the same species).

nextgenusfs avatar Sep 13 '22 17:09 nextgenusfs