Machine learning run to determine epigenetic marks driving small gene sets
Using those gene sets identified in our last chat for time series (Apul)
https://sr320.github.io/tumbling-oysters/posts/41-Apul-GO/
Will do - will aim to complete this week.
I can also get started with this!
It's not a competition... however we do have a new sticker board! 💯 So it is kind of a competition.
@shedurkin if you could start with making a gene count matrix for the genes that have been selected that would be great.
Will do! In the mean time I've already done a trial run of your ML pipeline using miRNA as predictors and all genes as the response -- the results are pretty interesting! Model performance is very high for many of the gene PCs (essentially a group of coregulated genes), with some R^2 values close to 1.
Looking at the miRNA that most contribute to predicting some of these PCs, the results differ. Sometimes several miRNA have high importance (e.g. PC11), suggesting a more complicated interplay is influencing gene expresion, while in other cases only one or two miRNA stand out (e.g., PC10, PC7).
Ok, I've isolated a bunch of gene sets that may be of interest. For each physiological/seasonal trait (e.g. host biomass, respiration, temperature, timepoint), I took all of the modules that are significantly assoiated with a that trait and
a) saved the functional annotations for all genes contained within those modules, and b) saved a raw counts matrix for only the genes contained within those modules
I also did the same for all genes that were annotated with at least one of the GO terms Steven provided above.
Code (the code for saving gene sets is at the very bottom) Output folder
Kathleen and I met last week and here are the next steps that Kathleen is working on:
- Prediction of expression in genes of interest (those that correlate with biomass) using miRNAs
- Prediction of expression of genes that have GO terms of interest (Steven's GO searches) using miRNAs
These will allow you to test for potential regulation of gene expression of genes that relate to physiological outcomes using miRNAs.
Finished for the following gene sets:
- host biomass ("Host_AFDW")
- symbiont photosynthesis ("Am")
- List of GO terms provided above by @sr320 ("ATP_production_GO")
@shedurkin can you now add lncRNA and DNA methylation data to miRNA to predict expression of those gene sets?
@shedurkin I think we can mark this as closed now, yes?