esm several problems with esmc

Hello everybody! today i'm trying to test ESM-C but i'm having an hard time due to different problems:

tokenizer initialization need ESM3 agreement. solved by doing login on huggingface hub
encoding sequences fail due to the absence of mask_token parameter in ESMC.tokenizer:

    sequence = sequence.replace(C.MASK_STR_SHORT, sequence_tokenizer.mask_token)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: replace() argument 2 must be str, not None

solved by manually calling tokenizer:

seq = 'AAAAAAAAAA'
res = client.tokenizer(seq,add_special_tokens=True)
ids = torch.tensor(res['input_ids'],dtype=torch.int64).to('cuda')

Then i called forward method of EMC class passing ids tensor but i got some mismatching dimension error inside rotary embedding:

esm/layers/rotary.py:54, in apply_rotary_emb_torch(x, cos, sin, interleaved, _inplace)
     50 cos = repeat(cos, "s d -> s 1 (2 d)")
     51 sin = repeat(sin, "s d -> s 1 (2 d)")
     52 return torch.cat(
     53     [
---> 54         x[..., :ro_dim] * cos + rotate_half(x[..., :ro_dim], interleaved) * sin,
     55         x[..., ro_dim:],
     56     ],
     57     dim=-1,
     58 )

RuntimeError: The size of tensor a (12) must match the size of tensor b (15) at non-singleton dimension 0

but i wasn't able to fix this

Also i would like to know if you plan an integration with Transformers libray enabling easier fine-tuning of the model

Dec 06 '24 16:12 j3rk0

Hi @j3rk0, my group made a wrapper for this that has full Huggingface integration and batching :) https://huggingface.co/Synthyra/ESMplusplus_small

Dec 06 '24 19:12 lhallee

Hi @j3rk0, my group made a wrapper for this that has full Huggingface integration and batching :) https://huggingface.co/Synthyra/ESMplusplus_small

Nice work!!

Dec 06 '24 19:12 j3rk0

Can you try updating to v3.1.1? These problems should be fixed. I'm not sure about the rotary embedding issue, I have not seen this. If you can give me a reproduction, I can try looking into it.

Dec 09 '24 23:12 ebetica

updating to 3.1.1 solved all previously citated issue, model both work by using tutorial or 'manual' code. however i noticed futher issues. first i compared the output of the model using tutorial code snippet to the 'manual' one:

protein = ESMProtein(sequence="AAAAA")
client = ESMC.from_pretrained("esmc_300m").to("cuda") # or "cpu"
client.eval()
with torch.no_grad():
    protein_tensor = client.encode(protein)
    logits_output = client.logits(
       protein_tensor, LogitsConfig(sequence=True, return_embeddings=True)
    )
    logit_sdk = torch.softmax(logits_output.logits.sequence,dim=-1)

seq = ['AAAAA']
with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    logit_manual = torch.softmax(client(ids).sequence_logits,dim=-1)


logit_sdk.isclose(logit_manual).all()

output: True then i tryied to check if the result are the same if using multiple sequence of different length:

seq = ['AAAAA','AAAAAAAAA']
with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    logits_withpad = torch.softmax(client(ids).sequence_logits[:1,:7,:],dim=-1)

logit_sdk.isclose(logits_withpad).all()

output: False So i suspected some issue with attention mask, so i manually passed it:

seq = ['AAAAA','AAAAAAAAA']
with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    amask = torch.tensor(tok['attention_mask'],dtype=torch.bool).to('cuda')
    logits_withpad_withamask = torch.softmax(client(ids, amask).sequence_logits[:1,:7,:],dim=-1)
logit_sdk.isclose(logits_withpad_withamask).all()

output:False then i tried to negate the padding mask:

seq = ['AAAAA','AAAAAAAAA']
with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    amask = torch.tensor(tok['attention_mask'],dtype=torch.bool).to('cuda')
    logits_withpad_negaamask = torch.softmax(client(ids, ~amask).sequence_logits[:1,:7,:],dim=-1)
logit_sdk.isclose(logits_withpad_negaamask).all()

output: False And also if i execute the following

logits_withpad_withamask.isclose(logits_withpad_negaamask).all() output: True so basiccaly attention mask is not doing anything at all. Finally i computed the logits without using padding and using both attention mask ( all true) and negated attention mask (all false):

seq = ['AAAAA']
with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    amask = torch.tensor(tok['attention_mask'],dtype=torch.bool).to('cuda')
    logit_manual_withamask = torch.softmax(client(ids,amask).sequence_logits,dim=-1)

with torch.no_grad():
    tok = client.tokenizer(seq,add_special_tokens=True,padding=True)
    ids = torch.tensor( tok['input_ids'],dtype=torch.int64).to('cuda')
    amask = torch.tensor(tok['attention_mask'],dtype=torch.bool).to('cuda')
    logit_manual_negaamask = torch.softmax(client(ids,~amask).sequence_logits,dim=-1)

and then i ran the following check:

logit_sdk.isclose(logit_manual_withamask).all() output: True

logit_sdk.isclose(logit_manual_negaamask).all() output: True

is this the desired behavior? Am I doing anything wrong?

EDIT: looking deeper in the code I noticed how you provide attention mask using sequence_id.unsqueeze(-1) == sequence_id.unsqueeze(-2) so basically passing the mask or the negated gave the same result. Still trying to understand if numerical differences with and without padding are normal numerical fluctuations or unwanted behavior. Just another question then i stop bothering :satisfied: :satisfied: i see model uses esm3 tokenizer, which provide '|' token for multimers, is this usable also in ESMC or (as ESM2) multimers are out-of-distribution ?

EDIT 2: The first test was done using Python 3.8 on a T4 GPU with CUDA version 12.2 and driver version 535.183.01. Now, I tried it on a Kaggle notebook using Python 3.10 and a P100 GPU with CUDA version 12.6 and driver version 560.35.03, and using a padded sequence gave the same result as the unpadded sequence. Additionally, using the Kaggle notebook's T4 GPU with CUDA version 12.6 and driver version 560.35.03 also gave the same result between padded and unpadded sequences. Now i tryied an A40 with python 3.10 with CUDA version 12.2 and driver version 535.183.01, padded sequence result differ from unpadded result. Also with T4, python 3.10 CUDA version 12.2 and driver version 535.183.01, padded and unpadded results differ. Also running the model on CPU solved the padding issue

Probably proper execution of esmc require a cuda version > 12.2 or a nvidia driver version > 535

Dec 10 '24 16:12 j3rk0