stringtie icon indicating copy to clipboard operation
stringtie copied to clipboard

Can StringTie --merge be used to create a slightly improved reference annotation?

Open ggstatgen opened this issue 5 years ago • 4 comments

Hi

The official dog gtf annotation on Ensembl V100 is missing some important genes. Interestingly, Ensembl also holds two other dog annotations, based on 2 different breeds.

I would like to 'improve' the official dog gtf by merging it with the other two gtfs. I have tried using StringTie to do this, based on the following command line

stringtie --merge -p 10 -v -o stringtie_merged_withref.gtf -G Canis_lupus_familiaris.CanFam3.1.100.chr_sorted.gtf input_gtf_list.txt

So I have the reference Ensembl annotation as an argument for -G followed by a list of the other gtf files to add.

The problem with this is that it'll discard all gene entries and only leave entries of type transcript or exon.

Any help/suggestions on how to do this properly appreciated

ggstatgen avatar May 25 '20 12:05 ggstatgen

Don't use stringtie --merge for this purpose, that feature was designed for merging multiple stringtie outputs, not generic annotation data. It can lead to losing some isoforms from the annotation by "assembling" overlapping/compatible isoforms together and in most cases you do not want that kind of data loss.

gffread (http://dx.doi.org/10.12688/f1000research.23297.1) would be a better choice for merging multiple annotations by reducing redundancy across multiple datasets, in a more conservative and controlled way. Take a look at the "Clustering" options of gffread (-M/-K/-Q) to control the merge process.

However gffread won't solve the problem of losing gene entries, those are still going to be lost in the process of merging data from multiple annotation files. The gene entries cannot be preserved practically if you merge two or more transcripts (from different annotation files) that have a different gene ID. gffread can however generate locus features during clustering, which also shows all the genes and transcripts that were merged/gathered in the same locus. One can easily transform those locus features into gene features. (gffread would group under the same locus all the transcripts linked by exon overlaps, which is how gene features are generally defined). We can discuss more about this in the gffread github (https://github.com/gpertea/gffread) issues section there, perhaps.

gpertea avatar May 25 '20 15:05 gpertea

Thanks so much for the speedy reply - will gladly test gffread and let you know how it goes for this on its github repo if needed!

ggstatgen avatar May 25 '20 15:05 ggstatgen

Hi, did anyone test using gffread to improve annotation? How is the result?

thanks, Cui

smallfishcui avatar Nov 23 '22 10:11 smallfishcui

@gpertea Hi I am trying to use gffread to improved axolotl reference annotation. But when I read a paper(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222033/) said,"While GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different gene annotations." I am kind of confused which one I should use improved axolotl reference annotation? Looking forward to your reply. Thank you in advance.

dxu104 avatar Oct 23 '23 16:10 dxu104