Orthofiller output incorporation?
Hi John,
This isn't an issue so much as a question/query. I've got a few genomes from the same genus that I'm predicting and annotating. I've used Orthofiller (https://github.com/mpdunne/orthofiller) to help find missing gene predictions after my initial individual prediction run's with Funannotate. Do you have any thought's on how I may then be able to incorporate that output for my subsequent annotation runs? The output of Orthofiller is in GTF format.
Here is a snippet of one of the output files: scaffold_1 AUGUSTUS CDS 4294697 4294940 0.53 + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS CDS 4295052 4295071 0.29 + 2 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS intron 4294941 4295051 0.28 + . transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS start_codon 4294697 4294699 . + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS stop_codon 4295069 4295071 . + 0 transcript_id "orthofiller_g1.t1"; gene_id "orthofiller_g1"; scaffold_1 AUGUSTUS CDS 4321123 4321587 0.97 - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS start_codon 4321585 4321587 . - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS stop_codon 4321123 4321125 . - 0 transcript_id "orthofiller_g2.t1"; gene_id "orthofiller_g2"; scaffold_1 AUGUSTUS CDS 4380025 4380213 1 - 0 transcript_id "orthofiller_g3.t1"; gene_id "orthofiller_g3"; scaffold_1 AUGUSTUS CDS 4380262 4380278 0.65 - 2 transcript_id "orthofiller_g3.t1"; gene_id "orthofiller_g3";
My motivation for this is that I have a novel native species with a fairly good genome assembly, but no RNA seq. However there are two other decent genomes that do have public RNA-seq data available for them and the predictions for those genomes looks far more robust and correct. I was hoping to utilize that information to help with my own novel genome.
Cheers, Chris
Hi @Dikaryotic.
Had you noticed that some genes were specifically missing? Are these gene models from orthofiller complete, ie do they have proper start/stop? This looks to be AUGUSTUS gtf format, so you could perhaps convert to standard GFF3 and then pass the gene models to --other_gff and give them some weight to be passed directly to EVM. I think Augustus has a gtf to GFF3 conversion script, gtf2gff.pl < input.gtf --out=out.gff3 --gff3.
But if they are not complete gene models, than you do not want to pass them off to EVM that way -- but rather they would be considered protein evidence -- which will only be used by EVM to score de novo gene models.
If the species/isolates have high nucleotide identity you could try to map the RNA-seq reads to your genome with funannotate train, that would only work well if they are indeed very closely related (ie different isolates of the same species).