DiffDock Question about esm

when I run this code python datasets/esm_embedding_preparation.py --protein_ligand_csv data/protein_ligand_example_csv.csv --out_file data/prepared_for_esm.fasta I met this notice:

encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
  0%|                                        | 39/16379 [00:03<20:34, 13.24it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
  0%|▏                                       | 79/16379 [00:07<30:56,  8.78it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/3f2a/3f2a_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 97/16379 [00:09<25:34, 10.61it/s]encountered unknown AA:  PCA  in the complex  /raid/ligl/data/data/PDBBind_processed/5t1k/5t1k_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 99/16379 [00:09<24:53, 10.90it/s]encountered unknown AA:  SEP  in the complex

When I produce dataset with esm2_3billion_embeddings.pt I met this notice:

loading complexes 9/17:  52%|████████▎       | 523/1000 [03:39<04:09,  1.91it/s]Skipping 2pcp because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 526/1000 [03:40<02:35,  3.04it/s]Skipping 4do4 because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 528/1000 [03:40<02:18,  3.40it/s]Skipping 1rri because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 530/1000 [03:41<03:11,  2.45it/s]Skipping 4k9g because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▌       | 539/1000 [03:44<02:45,  2.78it/s]Skipping 2qwd because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▋       | 540/1000 [03:45<02:50,  2.70it/s]Skipping 1hqh because of the error:
Encountered valid chain id that was not present in the LM embeddings

Both trainset and testst, it skip many items. Is it normal? How can I fix this error？

Dec 29 '22 11:12 Alue111

Same problem. Have you found a solution to this issue yet? 😞

Apr 06 '23 05:04 gaylong9

Probably a typo of some sort, TPO and SEP are not amino acid codes so they got skipped. Then when the fastq are processed later the discrepancy is found.

Apr 06 '23 11:04 RJ3

Same problem, it says it's skipping for every single complex.

May 04 '23 22:05 JuLieAlgebra

Are you sure that you are using the same .fasta file for both steps? @JuLieAlgebra Would you be able to describe the specific setting and procedure you run for one of the proteins where the issue occurs?

May 12 '23 17:05 HannesStark

It seems that the error occurs when one tries to retrain the model with own complexes containing and '' (underscore) in the name of the structure. In such a case the key_name = key.split('')[0] assigns the wrong value to the key_name.

Feb 05 '24 09:02 JacekKedzierski

SEP is phosphorylated SER and TPO is phosphorylated THR. I am wondering if anyone can explain how this kind of amino acids are handled by DiffDock? Does DiffDock just skip them? What will happen if SEP is renamed to SER? Will it be handled just as SER?

Jul 31 '24 19:07 kirmedvedev

Question about esm_embeddings