evodiff icon indicating copy to clipboard operation
evodiff copied to clipboard

Non-canonical amino acid sequence seen

Open sungyounjoo opened this issue 1 year ago • 2 comments

I used conditional sequence generation via evodiff.ipynb with my own MSA file. However it came with a amino acid sequence containing un-natural amino acid codes such as "Z" and "B"

And i would like to ask if it is fine and the un-natural amino acid means something or my own MSA is problem.

Thank you.

image

### Tasks

sungyounjoo avatar May 06 '24 17:05 sungyounjoo

Since our model is trained over additional amino acid codes (JOUBZX), it's possible to observe them at inference in your generations.

These are;

U = selenocysteine O = pyrrolysine B = D or N J = I or L Z = E or Q X = unknown

It could be due to many reasons, it's not clear that the MSA would be the problem.

To prevent the model from predicting these amino acids at inference you can change line 257 and 259 https://github.com/microsoft/evodiff/blob/683b08d208cb0df8910f2efbd4ed88c3d57eabf2/evodiff/generate_msa.py#L257

to:

p = preds[:, random_x, random_y, :20]

this will force the model to only generate seqeunces using the first 20 amino acids in MSA_ALPHABET: ACDEFGHIKLMNPQRSTVWY

sarahalamdari avatar Aug 08 '24 17:08 sarahalamdari