DiffDock icon indicating copy to clipboard operation
DiffDock copied to clipboard

Question about esm_embeddings

Open Alue111 opened this issue 3 years ago • 6 comments

when I run this code python datasets/esm_embedding_preparation.py --protein_ligand_csv data/protein_ligand_example_csv.csv --out_file data/prepared_for_esm.fasta I met this notice:

encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
  0%|                                        | 39/16379 [00:03<20:34, 13.24it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  TPO  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
  0%|▏                                       | 79/16379 [00:07<30:56,  8.78it/s]encountered unknown AA:  SEP  in the complex  /raid/ligl/data/data/PDBBind_processed/3f2a/3f2a_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 97/16379 [00:09<25:34, 10.61it/s]encountered unknown AA:  PCA  in the complex  /raid/ligl/data/data/PDBBind_processed/5t1k/5t1k_protein_processed.pdb . Replacing it with a dash - .
  1%|▏                                       | 99/16379 [00:09<24:53, 10.90it/s]encountered unknown AA:  SEP  in the complex  

When I produce dataset with esm2_3billion_embeddings.pt I met this notice:

loading complexes 9/17:  52%|████████▎       | 523/1000 [03:39<04:09,  1.91it/s]Skipping 2pcp because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 526/1000 [03:40<02:35,  3.04it/s]Skipping 4do4 because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 528/1000 [03:40<02:18,  3.40it/s]Skipping 1rri because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  53%|████████▍       | 530/1000 [03:41<03:11,  2.45it/s]Skipping 4k9g because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▌       | 539/1000 [03:44<02:45,  2.78it/s]Skipping 2qwd because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17:  54%|████████▋       | 540/1000 [03:45<02:50,  2.70it/s]Skipping 1hqh because of the error:
Encountered valid chain id that was not present in the LM embeddings

Both trainset and testst, it skip many items. Is it normal? How can I fix this error?

Alue111 avatar Dec 29 '22 11:12 Alue111

Same problem. Have you found a solution to this issue yet? 😞

gaylong9 avatar Apr 06 '23 05:04 gaylong9

Probably a typo of some sort, TPO and SEP are not amino acid codes so they got skipped. Then when the fastq are processed later the discrepancy is found.

RJ3 avatar Apr 06 '23 11:04 RJ3

Same problem, it says it's skipping for every single complex.

JuLieAlgebra avatar May 04 '23 22:05 JuLieAlgebra

Are you sure that you are using the same .fasta file for both steps? @JuLieAlgebra Would you be able to describe the specific setting and procedure you run for one of the proteins where the issue occurs?

HannesStark avatar May 12 '23 17:05 HannesStark

It seems that the error occurs when one tries to retrain the model with own complexes containing and '' (underscore) in the name of the structure. In such a case the key_name = key.split('')[0] assigns the wrong value to the key_name.

JacekKedzierski avatar Feb 05 '24 09:02 JacekKedzierski

SEP is phosphorylated SER and TPO is phosphorylated THR. I am wondering if anyone can explain how this kind of amino acids are handled by DiffDock? Does DiffDock just skip them? What will happen if SEP is renamed to SER? Will it be handled just as SER?

kirmedvedev avatar Jul 31 '24 19:07 kirmedvedev