Inconsistency among *read_assignments.tsv, *.transcript_counts.tsv, and *.gene_counts.tsv
Hi,
Thanks for developing this nice tool. I met one issue that the count is inconsistent among files as list in the title. My command line is: ./IsoQuant-3.5.0/isoquant.py -d nanopore --stranded forward --fastq fastq1 fastq2 fastq3 --reference human_genome.fa --genedb human_genome.gtf --complete_genedb --output output --prefix human --labels a b c -t 10 --model_construction_strategy default_ont --clean_start --matching_strategy default --splice_correction_strategy default_ont --model_construction_strategy default_ont --transcript_quantification unique_only --gene_quantification unique_only
Take gene ENSG00000290425 (its isoforms include ENST00000652586 and ENST00000617243) as an example. I extract all the information related with ENSG00000290425 from read_assignments.txt file as below.
ENSG00000290425.read_assignments.txt
According to *.transcript_counts.tsv, the transcript counts for ENST00000652586 and ENST00000617243 are 36 and 4, respectively. According to *.gene_counts.tsv, the gene count for ENSG00000290425 is 146. However, I got different values when extract from the read_assighments.txt file as below: grep ENST00000652586 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 19) grep ENST00000617243 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 3) grep ENSG00000290425 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 35)
Very confusing. Actually, I checked 1490 isoforms, of which 1279 isoforms have consistent count, 211 isoforms have inconsistent count.
It would be great if isoquant could provide the exact read ids that corresponds to *.transcript_counts.tsv and *.gene_counts.tsv.
Looking forward to hearing from you. Thanks.
Best, Aifu
Dear @airichli
This inconsistency is expected.
To add a +1 to an isoform abundance, IsoQuant requires a read to be assigned uniquely, meaning that it is consistent with the annotation and there is no ambiguity in the assignment. However, to add +1 to a gene count, a read can be assigned to 2 or more isoforms (of this gene) ambiguously. In other words, not all reads that are unambiguously assigned to a gene can be unambiguously assigned to a transcript. This explains inconsistency between gene counts and isoform counts.
Inconsistency between read assignments and isoforms counts are likely caused by the additional filtering imposed on counts. For example, to report a spliced isoform, it must have at least one spliced read assigned to it.
P.S. I suggest you to update you IsoQuant to the latest 3.6.3 version, there were some quantification improvements recently.
Best Andrey
Dear Andrey, thank you so much for your timely reply, I really appreciate it.
I would like to check: Assuming the *.transcript_counts.tsv, and *.gene_counts.tsv files (the quantification mode is unique-only, so only uniquely assigned reads are counted) are based on *read_assignments.tsv. The unique reads for ENST00000652586 in *read_assignments.tsv are only 19, but the count for ENST00000652586 in *.transcript_counts.tsv file is 36 (if there exists filtering, the number should less than 19). Does *.transcript_counts.tsv file includes non-unique reads as well?
My purpose is just want to know the 36 read IDs for ENST00000652586 in *.transcript_counts.tsv, it seems hard for me to deduce from *read_assignments.tsv. Is there any way to know the corresponding read IDs for the count in *.transcript_counts.tsv?
I will try the latest 3.6.3 version. Thanks.
Best, Aifu
Adding to this as this threw me off as well - how does this apply to novel genes identified by IsoQuant? They aren't present in the gene counts table, so I'm assuming the only way to obtain gene level counts would be to sum the total number of transcripts assigned to that gene? But I'm assuming that will come with the same caveat RE ambiguity.
Cheers, Moray
@airichli
I'm really sorry for the delay - somehow I missed a few replies.
Assuming the *.transcript_counts.tsv, and *.gene_counts.tsv files (the quantification mode is unique-only, so only uniquely assigned reads are counted) are based on *read_assignments.tsv. The unique reads for ENST00000652586 in *read_assignments.tsv are only 19, but the count for ENST00000652586 in *.transcript_counts.tsv file is 36 (if there exists filtering, the number should less than 19). Does *.transcript_counts.tsv file includes non-unique reads as well?
Reads that are marked as unique_minor_difference will also contribute to the counts. Other reads should no be considered in the default counting strategy.
@SwiftSeal - sorry for the delay as well.
Adding to this as this threw me off as well - how does this apply to novel genes identified by IsoQuant? They aren't present in the gene counts table, so I'm assuming the only way to obtain gene level counts would be to sum the total number of transcripts assigned to that gene? But I'm assuming that will come with the same caveat RE ambiguity.
Yes, you are right. Starting version 3.7 there is a separate gene counts file for novel genes as well.
Best Andrey