Any limit on the input protein length for ESM C?

Open jdcc2098 opened this issue 1 year ago • 1 comments

Thank you for your great work.

I am currently using the ESM C model to generate protein embeddings.

I want to know if there is a maximum sequence length that the model can handle?

Thank you for your assistance!

Jan 06 '25 14:01 jdcc2098

According to the blog post it looks like 2048

Training stages. ESM C is trained in two stages: Stage 1: For the first 1 million steps, the model uses a context length of 512, with metagenomic data constituting 64% of the training dataset. Stage 2: In the final 500,000 steps, the context length is increased to 2048, and the proportion of metagenomic data is reduced to 37.5%.

Jan 17 '25 21:01 KPHippe

2048 is correct, the correctness probably drops dramatically as you increase the length past that.

Sep 19 '25 20:09 ebetica