Low fragment number in several 5degC samples
I'm analyzing the RNASeq data that I aligned with STAR/featureCounts, and notice that several samples from the 5degC treatment (blue) have many fewer fragments (aligned/assigned to genes). @shedurkin do you see that in your analysis?
Hmm, yes I'm seeing the same thing, and when I look back at my MultiQC I see the same pattern in sequence counts there.
Plot of total reads
MultiQC plot of sequence counts (5C treatment includes samples 117-156)
But I dont see any patterns in alignment rates when using either kallisto or HISAT2/featureCounts
kallisto, percent aligned:
HISAT2/fetureCounts (exons), percent assigned:
HISAT2/featureCounts (genes), percent assigned:
It looks like the issue is with the raw data, likely at the sequencing step. It's odd that the 5C treatment has several samples with lower reads, while others have read counts on the higher end. Here's a table of the raw counts per sample with 5C rows highlighted in green raw read count info. I indicated which samples had quality concerns at RNA isolation, and which ones I tossed out since they were outliers in PCA. Here's the multiqc report of the raw data: raw-data-multiqc.
@kubu4 @sr320
The plot thickens... There are also some alignment read depth differences among treatments in the low coverage Whole Genome Sequencing data, particularly in the 5C treatment.
@kubu4 Sam- did you end up reaching out to the RNASeq sequencing facility about the weird read depth in many samples from the 5C treatment? What was the minimum # sequences that they needed to provide? Several samples have <10M. Feel free to connect me with them and I can do the questioning - it'd be great to get some answers from them!
Here's the no. of raw reads per sample:
The quote for the project was ~350M paired-end reads per lane (i.e. targeting ~20M per sample).
I'll contact them regarding the disparity in output across samples.
Where are we with this?
Azenta is currently looking into this issue.
I've pinged them weekly since my last post above and finally got a response today, with an apology and a statement that they'd look into where things stand.
Thanks for staying on top of this!
On Tue, Jul 30, 2024 at 12:47 PM kubu4 @.***> wrote:
I've pinged them weekly since my last post above and finally got a response today, with an apology and a statement that they'd look into where things stand.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/RobertsLab/resources/issues/1904*issuecomment-2259089662__;Iw!!K-Hz7m0Vt54!kQjqrxx_4Yaa2ISnqzFviz_EZwTnpNV346QYBsfl3s-4Sn1REkF5UrOYb-OKpWnFJxksSM_lsk3KULLqfspTjg$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AEDXA7I4DGH7BIRB556R6B3ZO7UUDAVCNFSM6AAAAABH4XQEM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJZGA4DSNRWGI__;!!K-Hz7m0Vt54!kQjqrxx_4Yaa2ISnqzFviz_EZwTnpNV346QYBsfl3s-4Sn1REkF5UrOYb-OKpWnFJxksSM_lsk3KULI7DUsoAQ$ . You are receiving this because you authored the thread.Message ID: @.***>
They've finally responded with some info.
They've offered to "top off" the samples with < 20M reads with more reads to reach that minimum. I can't see a downside to this, but figured I'd throw it out there for people to consider before I responded.
Also, a bit surprising (concerning?) is that they load samples for sequencing using volumes instead of library concentrations???!!!
Here's their response:
Hi Sam,
Thank you for your message and apologies for our delay here.
One of reasons that might have led to the uneven distribution of reads among the libraries, is the inconsistent input amounts of RNA among the samples. Since the input material for the library prep for sample 149 was loaded based on a volume, as opposed to concentration, this can affect the efficiency of enzymatic reactions during prep, such as fragmentation, end-repair, and adapter ligation. Thus, even with the same preparation among all the samples, these small differences can affect cluster generation efficiency on the sequencing flow cell and ultimately lead to variations in read counts.
However, I understand that the variations you are seeing here are not ideal. I am happy to top-off the samples that are below ~20M reads to provide you the minimum of quoted reads across all your samples. Would this be something of interest to you? This of course, would be done at no additional cost to you.
Yes, that IS weird that they aren't considering concentration when prepping for sequencing. Getting more data for those sample with <20M reads would be great. Yes please!
Any estimate on when they can provide the additional data?
This is actually on my to-do list for today.
Data was supposed to be done early last week, but haven't heard anything since.
Alrighty, got the supplemental data on Friday afternoon. Download finished this weekend and I've verified MD5 checksums.
Data is here:
https://owl.fish.washington.edu/nightingales/G_macrocephalus/30-943133806/supplemental/
ATM, I've kept the original data and this supplemental sequencing separate. However, it will likely greatly simplify working with the data if I just concatenate the data so we have one set of FastQs for each sample.
Notebook is here:
https://robertslab.github.io/sams-notebook/posts/2024/2024-09-09-Data-Received---G.macrocephalus-Supplemental-RNA-seq-Data-from-Azenta-Project-30-943133806/