selene icon indicating copy to clipboard operation
selene copied to clipboard

Use Selene to predict variant effects genome-wide

Open andrewSharo opened this issue 3 years ago • 0 comments

Dear Selene Developers,

Thanks for creating Selene. I'm finding it super helpful for getting into CNNs. I have trained a model with 36 features using Selene, and this model is saved in training_outputs/best_model.pth.tar. I would like to predict the effect of every possible mutation in the hg38 genome. This is equivalent to genome-wide in silico saturation mutagenesis. Is there an easy way to do this with Selene? I know this may seem unwise, and will take up a ton of storage. But I would like to do this if it's possible. Right now I'm running the below code, but instead of predicting every variant in each chromosome, it's predicting just 1,000 bases in each chromosome. I tried setting sequence_length to a larger value (300 million), but that led to huge memory usage, so I stopped the program. Is there an easier way to do this? My code is below:

import torch from selene_sdk.utils import DeeperDeepSEA from selene_sdk.utils import NonStrandSpecific

model_architecture = NonStrandSpecific(DeeperDeepSEA(1000, 36))

from selene_sdk.predict import AnalyzeSequences from selene_sdk.utils import load_features_list

features = load_features_list("distinct_features.txt") analysis = AnalyzeSequences( model_architecture, "training_outputs/best_model.pth.tar", sequence_length=300000000, # originally 1000 features=features, use_cuda=False) analysis.in_silico_mutagenesis_from_file("hg38.fa", save_data=["abs_diffs", "logits", "predictions"], output_dir="predictionsOnHg38/", use_sequence_name=True)

andrewSharo avatar Aug 04 '22 20:08 andrewSharo