esm icon indicating copy to clipboard operation
esm copied to clipboard

Any limit on the input protein length for ESM C?

Open jdcc2098 opened this issue 1 year ago • 1 comments

Thank you for your great work.

I am currently using the ESM C model to generate protein embeddings.

I want to know if there is a maximum sequence length that the model can handle?

Thank you for your assistance!

jdcc2098 avatar Jan 06 '25 14:01 jdcc2098

According to the blog post it looks like 2048

Training stages. ESM C is trained in two stages: Stage 1: For the first 1 million steps, the model uses a context length of 512, with metagenomic data constituting 64% of the training dataset. Stage 2: In the final 500,000 steps, the context length is increased to 2048, and the proportion of metagenomic data is reduced to 37.5%.

KPHippe avatar Jan 17 '25 21:01 KPHippe

2048 is correct, the correctness probably drops dramatically as you increase the length past that.

ebetica avatar Sep 19 '25 20:09 ebetica