progen icon indicating copy to clipboard operation
progen copied to clipboard

Sampling conditional token distribution

Open aiXander opened this issue 3 years ago • 1 comments

It would be super valuable to have an example script to sample conditional token probabilities for a target index given sequence context.

There seem to be some technical details that are important, but not easy to figure out:

Finally, the way I'm currently evaluating mutations is by sequentially computing sequence likelihoods for each possible mutated sequence, so this takes 20 forward passes per single point mutation. But I think this is vastly inefficient, since the model produces logits for every position, can the logits for the target index simply be used as a proxy for token probability?

aiXander avatar Jul 12 '22 18:07 aiXander

a couple notes:

  • we excluded the non-amino acid tokens when scoring sequences as they aren't relevant for variant prediction. it doesn't have that large of an effect however
  • the LMs released currently are traditional autoregressive decoders. left-to-right or right-to-left. there are ways to perform inpainting but would require restructuring/retraining
  • you can use the logits (or averaged logits from the N->C and C->N direction) for a target index but i've never validated this myself. it would be approximate and not fully use the remaining context of the protein which may be critical

a-mad avatar Jul 20 '22 21:07 a-mad