esm icon indicating copy to clipboard operation
esm copied to clipboard

Batch Support for Obtaining Residue Embeddings

Open Junseok0207 opened this issue 1 year ago • 3 comments

I am currently trying to obtain residue embeddings for protein sequences. The typical workflow involves the following steps:

protein = ESMProtein(sequence=sequence)
protein_tensor = self.model.encode(protein)
config = SamplingConfig(return_per_residue_embeddings=True)
output = client.forward_and_sample(protein_tensor, config)
embeddings = output.per_residue_embedding

However, I don't know how to get embeddings in batch mode. I checked the example in esm/examples/local_generate.py (lines 129-135), but it only shows the batch_generate function, which does not include a way to obtain embeddings. How can I achieve embeddings with batch?

Junseok0207 avatar Aug 12 '24 06:08 Junseok0207

Bumping this issue, I am also interested in learning if the batching function for generating embeddings is ready yet, and if possible, a small example script showing showing a potential use-case. In the mean time, could you theoretically loop through a list of fasta's and generate embeddings one at a time, or would there be a reason you would want to generate embeddings in batches?

winatony avatar Aug 14 '24 21:08 winatony

We currently don't have support for this, though it shouldn't be too bad to implement. You can definitely just loop through and generate one at a time unless you're running into speed concerns.

ebetica avatar Aug 27 '24 20:08 ebetica

Hi @Junseok0207 @winatony @ebetica, my group made a wrapper for this that has full Huggingface integration and batching. https://huggingface.co/Synthyra/ESMplusplus_small

lhallee avatar Dec 06 '24 19:12 lhallee