SeqDesign
SeqDesign copied to clipboard
Protein design and variant prediction using autoregressive generative models
SeqDesign
SeqDesign is a generative, unsupervised model for biological sequences. It is capable of learning functional constraints from unaligned sequences in order to predict the effects of mutations and generate novel sequences, including insertions and deletions. For more information, check out the biorxiv preprint.
This version of the codebase is compatible with Python 3 and Tensorflow 1.
For the Python 2.7 version used in the preprint, see the
v2 branch.
A PyTorch version is available here
Installation
See INSTALL.md.
Examples
See the examples directory for examples of training, mutation effect prediction, and generation.
Usage
Training
Given a fasta file of training sequences, run:
run_autoregressive_fr --dataset <your_dataset>.fa
Sequences are uniformly weighted by default. To set sequence
weights, append : and a weight to each fasta header, e.g. :1.0.
Mutation effect prediction
Deterministic:
calc_logprobs_seqs_fr --sess <your_sess> --dropout-p 1.0 --num-samples 1 --input <input>.fa --output <output>.csv
Average of 500 samples:
calc_logprobs_seqs_fr --sess <your_sess> --dropout-p 0.5 --num-samples 500 --input <input>.fa --output <output>.csv
Sequence generation
generate_sample_seqs_fr --sess <your_sess>
Run each script with the -h argument to see additional arguments.
Data availability
See the examples directory to download training sequences, mutation effect predictions, and generated sequences.