EvoProtGrad icon indicating copy to clipboard operation
EvoProtGrad copied to clipboard

is it possible to get the importance score of the protein sequence?

Open anonimoustt opened this issue 2 years ago • 8 comments

I was just wondering is it possible to get the importance score of the protein sequence using EvoProtGrad model? For instance, in https://huggingface.co/datasets/waylandy/phosformer_curated data there are kinase enzymes. Now I want to rank the kinase enzymes based on importance scores.

Furthermore, I found in (https://colab.research.google.com/drive/1e8WjYEbWiikRQg3g4YHQJJcpvTIWVAjp?usp=sharing) that the scores are generated for different variants of a protein sequence. But what is the score of the original protein sequence ? If the score of original sequence can be measured then it can be compared with other variants?

anonimoustt avatar Jan 28 '24 21:01 anonimoustt

Hi, sorry for the delay in getting back to you!

The score of the original protein sequence (i.e., the wild type sequence specified via the wt_fasta or wt_protein arguments of the DirectedEvolution sampler class), is stored in this wt_score attribute within each expert. Each expert uses this wt_score to compute the relative score of a variant with respect to the wild type.

As to getting importance scores of each variant, the DirectedEvolution sampler will return both the list of variants and their corresponding scores as a tuple. You can see in the demo notebook--when the output argument is set to "all", the scores tensor will have shape [parallel_chains, steps], and it's up to you to decide whether to grab the last score for each variant (scores[:,-1]) or the best, etc.

pemami4911 avatar Feb 09 '24 20:02 pemami4911

It is not clear. Specifically, from the code variants, scores = evo_prot_grad.DirectedEvolution( wt_protein = wildtype_sequence, output = 'best', # return best, last, all variants experts = [expert], # list of experts to compose parallel_chains = 2, # number of parallel chains to run n_steps = 100, # number of MCMC steps per chain max_mutations = -1, # maximum number of mutations per variant preserved_regions = None, # List of regions (start,end) to preserve verbose = False # print debug info to command line )()

wtseq = ' '.join(wildtype_sequence.strip())

for v,s in zip(variants,scores): evo_prot_grad.common.utils.print_variant_in_color(v, wtseq) print(s)

if I set output = 'all', then I will get the original sequence with score along with variant right?

anonimoustt avatar Feb 09 '24 21:02 anonimoustt

No, scores will only contain a score for each variant, even if output is set to all. Here, all refers to returning the intermediate scores of the variants at each sampling step. In this example, scores would have shape [2,100] since parallel_chains = 2 and n_steps = 100. If having the wildtype sequence's score returned alongside the scores of each variant is useful, I can add that.

pemami4911 avatar Feb 10 '24 21:02 pemami4911

Hi, Yes it would be helpful if the score of the original sequence can be determined. I did not understand scores would have shape [2,100]. I see the score in float number format. parallel_chains = 2 defines top two best variants based on score right. Would you please clarify?

Also how was the score computed? Are you taking embedding: let us say using ESM-2 model you are computing the embedding of original sequence, and its variants . Next, we are computing the cosine similarity?

anonimoustt avatar Feb 11 '24 00:02 anonimoustt

I think it could help to spend a little time reading the documentation about what scores are in EvoProtGrad and how they are estimated: https://nrel.github.io/EvoProtGrad/getting_started/experts/#what-is-a-product-of-experts ! The score in EvoProtGrad is an unnormalized log probability. However, in practice we subtract the wild type sequence log prob from the variant log prob, so the score actually is a difference between log probs.

The shape of the scores tensor will vary depending on what you set the argument output to. If output = best or output = last, that means for each of the parallel_chains Markov chains, either the best/last (respectively) variants will be returned. Hence, scores has shape [parallel_chains]. When output = all, this means every variant produced by each Markov chain at each step 1..n_steps will be returned, hence scores has shape [parallel_chains, n_steps]. This is useful when entire distributions of "good" variants are desired instead of just point estimates of "good" variants.

pemami4911 avatar Feb 12 '24 14:02 pemami4911

Thanks. EvoProtGrad is really interesting. I am working on kinase domain sequences ( https://huggingface.co/datasets/waylandy/phosformer_curated/raw/main/curated/phosphosites_11mer_kinase_specific.tsv). EvoProtGrad might be interesting tool to get the variants of a kinase sequence for analysis.

anonimoustt avatar Feb 13 '24 03:02 anonimoustt

Hi one more query: Can EvoProtGrad be used to detection significant connection between two protein sequences? Let us say, I have protein 1 and protein 2 two sequences. Now using EvoProtGrad I got the top 3 variants of protein1 and top 3 variants of protein 2. Then compute the similarity scores of the variants is it possible get the relational significance of the protein 1 and protein 2.

anonimoustt avatar Feb 17 '24 15:02 anonimoustt

Hi ,

I see if parallel_chains = 5, then I see the 5 variants and the corresponding score. Higher the score means more closer to the original sequence?

anonimoustt avatar Feb 28 '24 20:02 anonimoustt

Accessing a particular expert's score for a variant sequence is now easier in v0.2 https://github.com/NREL/EvoProtGrad/releases/tag/v0.2. You can now call get_model_output with an expert to get this particular expert's score https://nrel.github.io/EvoProtGrad/api/experts/.

pemami4911 avatar Jul 10 '24 23:07 pemami4911