IsoQuant icon indicating copy to clipboard operation
IsoQuant copied to clipboard

Uniquely assigned reads seem non-unique

Open airichli opened this issue 11 months ago • 2 comments

Hi,

The *transcript_counts.tsv file seems to count non-unique reads as unique reads (I use the parameter --transcript_quantification unique_only --gene_quantification unique_only, so only unique reads would be considered). Here I show one example (HDHD5 gene in human):

Image

All reads shown in the figure are considered unique reads from IsoQuant, regardless of v3.5.0 or v3.6.3. However, all these shown reads are mapped to multiple isoforms.

I also include this example as a test data for your check, you can easily reproduce the result in few seconds. The possible reason IsoQuant treats these reads as unique may because the polyA position has one closest match, but all matches just differ in about 10 bp, using polyA position to decide uniqueness may not be appropriate. Could you please explain why IsoQuant treat these reads as unique?

test_data.tar.gz

Looking forward to hearing from you. Thanks.

airichli avatar Feb 21 '25 17:02 airichli

Dear @airichli

Thanks a lot for the report and for the test data, it made it quite easy to go deep into the algorithms.

TLDR: it looks like there is no obvious bug, but the case is quite tricky due to annotated isoforms with very closely located polyA sites.

Here's the full story:

  1. The intron chains of these reads is, indeed, ambiguous (as you show on the figure). Most of the reads have intron chain matching to 'ENST00000155674', 'ENST00000477157', 'ENST00000336737'.
  2. What IsoQuant does in this case, it looks into exon coverage. Basically, it checks whether exons (typically flanking) can give us more information and distinguish between isoforms with the same intron chain.
  3. What happens in this particular case: These 3 isoforms have very closely located polyA sites, namely 17137511, 17137520 and 17137522. Since it all happens on the reverse strand, it looks roughly like this:
ENST00000155674
polyA [                         ]-----------[ ....
ENST00000336737
   polyA  [                     ]-----------[ ....
ENST00000477157
      polyA  [                  ]-----------[ ....
  1. To examine how the exons overlap, IsoQuant splits all the exons into non-overlapping exon regions (exonic blocks):
ENST00000155674
polyA [   ][  ][                 ]-----------[ ....
ENST00000336737
    polyA  [  ][                 ]-----------[ ....
ENST00000477157
       polyA   [                 ]-----------[ ....
  1. What happens with a read, that has a identical intron chain. For example the read with ID starting with d227a379, which has a polyA site at 17137529.
ENST00000155674
polyA [   ][  ][                 ]-----------[ ....
ENST00000336737
    polyA  [  ][                 ]-----------[ ....
ENST00000477157
        polyA  [                 ]-----------[ ....
d227a379
           polyA   [             ]-----------[ ....

IsoQuant looks at all 3 isoform candidates and sees that the 2 leftmost exonic blocks of ENST00000155674 are located entirely within polyA region of our read. Similarly, for the leftmost block of ENST00000336737 - it is entirely covered by polyA tail. For IsoQuant this is enough to discard both isoforms.

However, for ENST00000477157, only a relatively small part of its terminal exon block is covered by reads polyA tail. Moreover, read's polyA site is located just 7 bp away from the polyA site of ENST00000477157, which is close enough to report at polyA site match.

  1. Thus, this situation is rather an effect of annotation having very similar polyA site coordinates. I understand that in case the distance is larger, there would be no questions about these strategy. However, is this kind of situations the exonic blocks are quite small and might not be enough to make judgement.

Hope my explanations are clear, do no hesitate to ask questions. I might also consider revising the strategy in such cases.

Best Andrey

andrewprzh avatar Feb 27 '25 00:02 andrewprzh

Thank you so much Andrey for the detailed explaination.

airichli avatar Feb 28 '25 17:02 airichli