pizzly icon indicating copy to clipboard operation
pizzly copied to clipboard

handling duplicated sequences in output.fusions.fasta

Open racng opened this issue 6 years ago • 0 comments

Hi, in the filtered fusions.fasta output run on a single sample, I found that the number of headers is larger than the number of unique sequences:

grep -v  '>' output.fusions.fasta | sort| uniq | wc -l
11429
grep '>' output.fusions.fasta | sort| uniq | wc -l
13330

Here I looked for the header of a random repeated sequence:

grep -B 1 AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG$ output.fusions.fasta 
>ENST00000547219.5_0:182_ENST00000266679.8_1611:2229
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG
--
>ENST00000547219.5_0:182_ENST00000456847.7_1297:1915
AAAACCCTTCCCTCCCCCGCTCCCCCGGAAGTGCTTTTCCAAGATTCGGGCCGGAGAGAGGCCTTGTAGGCACAGCGGCTGAGACTCGATCTGCTCCAAGTAGGGGCTCCAGCGCGGGTCGGAGTCTGGGGGTTCGCGCCCGCCGACCCGCGCCCTGCTCCCTCTCAGCACCTGGGCGGACGACGTGAACGATCAAGAGAGAGGGACCATAGTAGATCACGAGAAAAGAGTCGACGTCATAAATCCCGTAGTAGAGACCGTCATGACGATTATTACAGAGAGAGAAGCAGAGAACGAGAGAGGCACCGGGATCGTGACCGAGACCGTGACCGAGAGCGTGACCGAGAGCGCGAATATCGTCATCGTTAGAAGCTGAAGGAAGAGGATCACCTTCCAAGACAAAACAGTCTTCATGGGGGAAAAATGACGCTTGTCCAGCAGTTTGCTTCTTGTGATTGAACTGAACCTGTAAGGATTCATGGATAAAATGAACAGGAATAGATCTGAATAAAGCAAATCTGCATAAATGGTAACCAGTAGCTCTACTTTTATTTTTTATGTTGCTTAACTGTTTTATTTGAAGGAAACCTGTGTGATTTAAAAAGTTATAGCTTTTGCAACTTTATTACTGGTTATATACATTTGGCCATTATGATGTGCAAGCAATTGGAAAAAAAGTCAAGTAAATGCTTGTTTTTGTAGTAGTTTGTTCTTGTTAAAAATGTTTATATGATAATGTCTGTAAACAGCATCACTTTGATTACAATAGATGTAGTGTTGTAATAAACTGTTTAATGGGG

The sequence is repeated because it can be ENST00000547219.5 pairing with ENST00000266679.8 or ENST00000456847.7. However, I found that ENST00000266679.8 and ENST00000456847.7 are transcripts for the same gene. This seems biologically redundant. Would it make sense to reduce the redundancy by converting ENST to ENSG, and then keep unique header+sequence pairs before proceeding to kallisto requant?

racng avatar Apr 08 '20 05:04 racng