StringTie --merge
Hey,
I have a question regarding StringTie --merge:
I am using StringTie to assemble novel transcripts in a very small non model genome. The genome has very few introns and mostly single exon genes. I was following the basic reference guided StringTie workflow and have assembled transcripts using a dataset of around 60 paired-end & strand specific RNA-seq samples (50bp) with various reads counts and conditons. In an initial run I looked for assembled transcripts I deemed novel. However, after including more data in the assembly process and merging the assembled files, many transcripts that were predicted prior to the inclusion were lost.
I am running stringtie with default options:
stringtie -rf -g 5 -G reference.gtf -o /out.gtf -p 10 -l labels in.bam
followed with stringtie --merge:
stringtie --merge -p 5 -G reference.gtf -o stringtie_merged.gtf mergelist.txt
I then extracted novel assembled transcripts that were not contained in the reference annotationy. (MSTRG.31 in this example)
seqnames start end width strand source type score phase gene_id transcript_id ref_gene_id exon_number gene_name
76 LT962476.2 10373 11869 1497 - StringTie transcript 1000 NA MSTRG.30 unassigned_transcript_2 BQ9382_C1-0010 <NA> XUT5
77 LT962476.2 10373 11869 1497 - StringTie exon 1000 NA MSTRG.30 unassigned_transcript_2 BQ9382_C1-0010 1 XUT5
78 LT962476.2 12209 13567 1359 - StringTie transcript 1000 NA MSTRG.31 MSTRG.31.1 <NA> <NA> <NA>
79 LT962476.2 12209 13567 1359 - StringTie exon 1000 NA MSTRG.31 MSTRG.31.1 <NA> 1 <NA>
80 LT962476.2 12332 13459 1128 + StringTie transcript 1000 NA MSTRG.32 unassigned_transcript_3 BQ9382_C1-0015 <NA> ACIB2EUKG767782
81 LT962476.2 12332 13459 1128 + StringTie exon 1000 NA MSTRG.32 unassigned_transcript_3 BQ9382_C1-0015 1 ACIB2EUKG767782
92 LT962476.2 15085 15498 414 - StringTie transcript 1000 NA MSTRG.38 unassigned_transcript_4 BQ9382_C1-0020 <NA> ACIB2EUKG767783
93 LT962476.2 15085 15498 414 - StringTie exon 1000 NA MSTRG.38 unassigned_transcript_4 BQ9382_C1-0020 1 ACIB2EUKG767783
94 LT962476.2 16359 16598 240 + StringTie transcript 1000 NA MSTRG.39 unassigned_transcript_5 BQ9382_C1-0025 <NA> ACIB2EUKG767784
95 LT962476.2 16359 16598 240 + StringTie exon 1000 NA MSTRG.39 unassigned_transcript_5 BQ9382_C1-0025 1 ACIB2EUKG767784
I then included two additonal datasets, also consisting of paired-end strand specific data (150bp) with various reads counts and conditons. I ran stringtie-merge for each separate dataset (MSTRG.31 does not pop up at its coorinates for the two additional datasets as I expected due to different sample conditions etc.):
Additional dataset 1:
seqnames start end width strand source type score phase gene_id transcript_id ref_gene_id exon_number gene_name
136 LT962476.2 10373 11869 1497 - StringTie transcript 1000 NA MSTRG.45 unassigned_transcript_2 BQ9382_C1-0010 <NA> XUT5
137 LT962476.2 10373 11869 1497 - StringTie exon 1000 NA MSTRG.45 unassigned_transcript_2 BQ9382_C1-0010 1 XUT5
138 LT962476.2 12332 13459 1128 + StringTie transcript 1000 NA MSTRG.46 unassigned_transcript_3 BQ9382_C1-0015 <NA> ACIB2EUKG767782
139 LT962476.2 12332 13459 1128 + StringTie exon 1000 NA MSTRG.46 unassigned_transcript_3 BQ9382_C1-0015 1 ACIB2EUKG767782
140 LT962476.2 15085 15498 414 - StringTie transcript 1000 NA MSTRG.47 unassigned_transcript_4 BQ9382_C1-0020 <NA> ACIB2EUKG767783
141 LT962476.2 15085 15498 414 - StringTie exon 1000 NA MSTRG.47 unassigned_transcript_4 BQ9382_C1-0020 1 ACIB2EUKG767783
142 LT962476.2 16359 16598 240 + StringTie transcript 1000 NA MSTRG.48 unassigned_transcript_5 BQ9382_C1-0025 <NA> ACIB2EUKG767784
143 LT962476.2 16359 16598 240 + StringTie exon 1000 NA MSTRG.48 unassigned_transcript_5 BQ9382_C1-0025 1 ACIB2EUKG767784
Additional dataset 2 shows two other novel transcripts (MSTRG.32 & MSTRG.33):
seqnames start end width strand source type score phase gene_id transcript_id ref_gene_id exon_number gene_name
75 LT962476.2 8200 9269 1070 + StringTie transcript 1000 NA MSTRG.32 MSTRG.32.1 <NA> <NA> <NA>
76 LT962476.2 8200 9269 1070 + StringTie exon 1000 NA MSTRG.32 MSTRG.32.1 <NA> 1 <NA>
77 LT962476.2 9389 9693 305 + StringTie transcript 1000 NA MSTRG.33 MSTRG.33.1 <NA> <NA> <NA>
78 LT962476.2 9389 9693 305 + StringTie exon 1000 NA MSTRG.33 MSTRG.33.1 <NA> 1 <NA>
79 LT962476.2 10373 11869 1497 - StringTie transcript 1000 NA MSTRG.34 unassigned_transcript_2 BQ9382_C1-0010 <NA> XUT5
80 LT962476.2 10373 11869 1497 - StringTie exon 1000 NA MSTRG.34 unassigned_transcript_2 BQ9382_C1-0010 1 XUT5
81 LT962476.2 12332 13459 1128 + StringTie transcript 1000 NA MSTRG.35 unassigned_transcript_3 BQ9382_C1-0015 <NA> ACIB2EUKG767782
82 LT962476.2 12332 13459 1128 + StringTie exon 1000 NA MSTRG.35 unassigned_transcript_3 BQ9382_C1-0015 1 ACIB2EUKG767782
83 LT962476.2 15085 15498 414 - StringTie transcript 1000 NA MSTRG.36 unassigned_transcript_4 BQ9382_C1-0020 <NA> ACIB2EUKG767783
84 LT962476.2 15085 15498 414 - StringTie exon 1000 NA MSTRG.36 unassigned_transcript_4 BQ9382_C1-0020 1 ACIB2EUKG767783
107 LT962476.2 16359 16598 240 + StringTie transcript 1000 NA MSTRG.44 unassigned_transcript_5 BQ9382_C1-0025 <NA> ACIB2EUKG767784
108 LT962476.2 16359 16598 240 + StringTie exon 1000 NA MSTRG.44 unassigned_transcript_5 BQ9382_C1-0025 1 ACIB2EUKG767784
I expected, that including all assembled GTFs in my mergelist.txt file and executing stringtie-merge would result in a merged annotation including MSTR.31 from my initial run as well as for example MSTRG.32 & MSTRG.33 from the additional dataset 2. However in the final merged GTF they are all missiang at the location:
seqnames start end width strand source type score phase gene_id transcript_id gene_name ref_gene_id exon_number
148 LT962476.2 10373 11869 1497 - StringTie transcript 1000 NA MSTRG.48 unassigned_transcript_2 XUT5 BQ9382_C1-0010 <NA>
149 LT962476.2 10373 11869 1497 - StringTie exon 1000 NA MSTRG.48 unassigned_transcript_2 XUT5 BQ9382_C1-0010 1
150 LT962476.2 12332 13459 1128 + StringTie transcript 1000 NA MSTRG.49 unassigned_transcript_3 ACIB2EUKG767782 BQ9382_C1-0015 <NA>
151 LT962476.2 12332 13459 1128 + StringTie exon 1000 NA MSTRG.49 unassigned_transcript_3 ACIB2EUKG767782 BQ9382_C1-0015 1
152 LT962476.2 15085 15498 414 - StringTie transcript 1000 NA MSTRG.50 unassigned_transcript_4 ACIB2EUKG767783 BQ9382_C1-0020 <NA>
153 LT962476.2 15085 15498 414 - StringTie exon 1000 NA MSTRG.50 unassigned_transcript_4 ACIB2EUKG767783 BQ9382_C1-0020 1
154 LT962476.2 16359 16598 240 + StringTie transcript 1000 NA MSTRG.51 unassigned_transcript_5 ACIB2EUKG767784 BQ9382_C1-0025 <NA>
155 LT962476.2 16359 16598 240 + StringTie exon 1000 NA MSTRG.51 unassigned_transcript_5 ACIB2EUKG767784 BQ9382_C1-0025 1
We are a bit puzzled by this behaviour. Could you expain how stringtie --merge includes or excludes assembled transcripts and what could have happend here?
Thank you very much!
This is a guess here, but I believe it is because they are single-exon. Then, the merge option excludes everything single-exon