Question about esm_embeddings
when I run this code
python datasets/esm_embedding_preparation.py --protein_ligand_csv data/protein_ligand_example_csv.csv --out_file data/prepared_for_esm.fasta
I met this notice:
encountered unknown AA: TPO in the complex /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA: SEP in the complex /raid/ligl/data/data/PDBBind_processed/4yo6/4yo6_protein_processed.pdb . Replacing it with a dash - .
0%| | 39/16379 [00:03<20:34, 13.24it/s]encountered unknown AA: SEP in the complex /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA: SEP in the complex /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA: TPO in the complex /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
encountered unknown AA: SEP in the complex /raid/ligl/data/data/PDBBind_processed/6esa/6esa_protein_processed.pdb . Replacing it with a dash - .
0%|▏ | 79/16379 [00:07<30:56, 8.78it/s]encountered unknown AA: SEP in the complex /raid/ligl/data/data/PDBBind_processed/3f2a/3f2a_protein_processed.pdb . Replacing it with a dash - .
1%|▏ | 97/16379 [00:09<25:34, 10.61it/s]encountered unknown AA: PCA in the complex /raid/ligl/data/data/PDBBind_processed/5t1k/5t1k_protein_processed.pdb . Replacing it with a dash - .
1%|▏ | 99/16379 [00:09<24:53, 10.90it/s]encountered unknown AA: SEP in the complex
When I produce dataset with esm2_3billion_embeddings.pt I met this notice:
loading complexes 9/17: 52%|████████▎ | 523/1000 [03:39<04:09, 1.91it/s]Skipping 2pcp because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17: 53%|████████▍ | 526/1000 [03:40<02:35, 3.04it/s]Skipping 4do4 because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17: 53%|████████▍ | 528/1000 [03:40<02:18, 3.40it/s]Skipping 1rri because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17: 53%|████████▍ | 530/1000 [03:41<03:11, 2.45it/s]Skipping 4k9g because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17: 54%|████████▌ | 539/1000 [03:44<02:45, 2.78it/s]Skipping 2qwd because of the error:
Encountered valid chain id that was not present in the LM embeddings
loading complexes 9/17: 54%|████████▋ | 540/1000 [03:45<02:50, 2.70it/s]Skipping 1hqh because of the error:
Encountered valid chain id that was not present in the LM embeddings
Both trainset and testst, it skip many items. Is it normal? How can I fix this error?
Same problem. Have you found a solution to this issue yet? 😞
Probably a typo of some sort, TPO and SEP are not amino acid codes so they got skipped. Then when the fastq are processed later the discrepancy is found.
Same problem, it says it's skipping for every single complex.
Are you sure that you are using the same .fasta file for both steps? @JuLieAlgebra Would you be able to describe the specific setting and procedure you run for one of the proteins where the issue occurs?
It seems that the error occurs when one tries to retrain the model with own complexes containing and '' (underscore) in the name of the structure. In such a case the key_name = key.split('')[0] assigns the wrong value to the key_name.
SEP is phosphorylated SER and TPO is phosphorylated THR. I am wondering if anyone can explain how this kind of amino acids are handled by DiffDock? Does DiffDock just skip them? What will happen if SEP is renamed to SER? Will it be handled just as SER?