MolScribe icon indicating copy to clipboard operation
MolScribe copied to clipboard

Options for Not Expanding Abbreviated Structures During Prediction

Open zuhalcakir opened this issue 1 year ago • 2 comments

For my specific use case, I am interested in predicting the graphs without expanding the abbreviated structures. I have been trying to understand the code and the sequence of operations it performs, but it’s not entirely clear to me whether this is possible with the current implementation.

Could you please clarify if there is an option or a straightforward way to modify the code to achieve this? Any guidance or suggestions would be greatly appreciated.

Thank you for your time and assistance.

zuhalcakir avatar Jun 27 '24 17:06 zuhalcakir

Hello @zuhalcakir ,

Can you develop a bit on your goal?

Because, in the current implementation, if there are abbreviated group, they are still stored in the prediction['molfile'] and the prediction's graph (i.e prediction['atom'] and prediction['bond']). And if for example you perform: Chem.MolToSmiles(Chem.MolFromMolBlock(pred[0]['molfile']), you will get the smiles where those groups are not developped, just put as wildcard.

But can you elaborate more? may be I didn't really understand the goal.

Here is an example:

Image

For this mol, the output is :

[{'smiles': 'Cc1cccnc1Sc1c(Br)cc(C(C)(C)C)cc1Br', 'molfile': '\n RDKit 2D\n\n 17 18 0 0 0 0 0 0 0 0999 V2000\n 15.8463 -2.6508 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 15.8463 0.0476 0.0000 R 0 0 0 0 0 0 0 0 0 0 0 0\n 13.5414 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 11.5246 -2.6508 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0\n 9.2197 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 -2.6508 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 0.0476 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0\n 4.8980 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 4.8980 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 2.8812 -8.0476 0.0000 R 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 -8.0476 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 9.2197 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 11.8127 -8.0476 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0\n 13.5414 -6.6190 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0\n 15.8463 -8.0476 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 17.8631 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 17.8631 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 1 2 1 0\n 1 3 2 0\n 1 17 1 0\n 3 4 1 0\n 3 14 1 0\n 4 5 1 0\n 5 6 2 0\n 5 12 1 0\n 6 7 1 0\n 6 8 1 0\n 8 9 2 0\n 9 10 1 0\n 9 11 1 0\n 11 12 2 0\n 12 13 1 0\n 14 15 2 0\n 15 16 1 0\n 16 17 2 0\nA 2\nMe\nA 10\nt-Bu\nM END\n', 'confidence': 0.8928430181866542, 'atoms': [{'atom_symbol': 'C', 'x': 0.873015873015873, 'y': 0.36507936507936506, 'confidence': 0.6537588238716125}, {'atom_symbol': '[Me]', 'x': 0.873015873015873, 'y': 0.09523809523809523, 'confidence': 0.9198883671557538}, {'atom_symbol': 'C', 'x': 0.746031746031746, 'y': 0.49206349206349204, 'confidence': 0.9090003967285156}, {'atom_symbol': 'S', 'x': 0.6349206349206349, 'y': 0.36507936507936506, 'confidence': 0.90590500831604}, {'atom_symbol': 'C', 'x': 0.5079365079365079, 'y': 0.49206349206349204, 'confidence': 0.9058762788772583}, {'atom_symbol': 'C', 'x': 0.3968253968253968, 'y': 0.36507936507936506, 'confidence': 0.8727304339408875}, {'atom_symbol': 'Br', 'x': 0.3968253968253968, 'y': 0.09523809523809523, 'confidence': 0.9228599590457234}, {'atom_symbol': 'C', 'x': 0.2698412698412698, 'y': 0.49206349206349204, 'confidence': 0.9096300005912781}, {'atom_symbol': 'C', 'x': 0.2698412698412698, 'y': 0.7619047619047619, 'confidence': 0.905170738697052}, {'atom_symbol': '[t-Bu]', 'x': 0.15873015873015872, 'y': 0.9047619047619048, 'confidence': 0.9341090351133937}, {'atom_symbol': 'C', 'x': 0.3968253968253968, 'y': 0.9047619047619048, 'confidence': 0.9104658365249634}, {'atom_symbol': 'C', 'x': 0.5079365079365079, 'y': 0.7619047619047619, 'confidence': 0.8941925168037415}, {'atom_symbol': 'Br', 'x': 0.6507936507936508, 'y': 0.9047619047619048, 'confidence': 0.9174843038480519}, {'atom_symbol': 'N', 'x': 0.746031746031746, 'y': 0.7619047619047619, 'confidence': 0.9128627777099609}, {'atom_symbol': 'C', 'x': 0.873015873015873, 'y': 0.9047619047619048, 'confidence': 0.9044473171234131}, {'atom_symbol': 'C', 'x': 0.9841269841269841, 'y': 0.7619047619047619, 'confidence': 0.9079452753067017}, {'atom_symbol': 'C', 'x': 0.9841269841269841, 'y': 0.49206349206349204, 'confidence': 0.9077780246734619}], 'bonds': [{'bond_type': 'single', 'endpoint_atoms': (0, 1), 'confidence': 1.0}, {'bond_type': 'single', 'endpoint_atoms': (0, 2), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (0, 16), 'confidence': 0.9999987483024597}, {'bond_type': 'single', 'endpoint_atoms': (2, 3), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (2, 13), 'confidence': 0.9999998807907104}, {'bond_type': 'single', 'endpoint_atoms': (3, 4), 'confidence': 1.0}, {'bond_type': 'single', 'endpoint_atoms': (4, 5), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (4, 11), 'confidence': 0.9999994039535522}, {'bond_type': 'single', 'endpoint_atoms': (5, 6), 'confidence': 0.9815445840358734}, {'bond_type': 'double', 'endpoint_atoms': (5, 7), 'confidence': 0.9999999403953552}, {'bond_type': 'single', 'endpoint_atoms': (7, 8), 'confidence': 0.9999995827674866}, {'bond_type': 'single', 'endpoint_atoms': (8, 9), 'confidence': 0.9999995827674866}, {'bond_type': 'double', 'endpoint_atoms': (8, 10), 'confidence': 0.9999983906745911}, {'bond_type': 'single', 'endpoint_atoms': (10, 11), 'confidence': 0.9999998807907104}, {'bond_type': 'single', 'endpoint_atoms': (11, 12), 'confidence': 0.9999109506607056}, {'bond_type': 'single', 'endpoint_atoms': (13, 14), 'confidence': 0.9999999403953552}, {'bond_type': 'double', 'endpoint_atoms': (14, 15), 'confidence': 0.9999999403953552}, {'bond_type': 'single', 'endpoint_atoms': (15, 16), 'confidence': 1.0}]}]

We see that the abbreviated groups are still there. And print(Chem.MolToSmiles(Chem.MolFromMolBlock(molscribe_pred[0]['molfile']))) give: "*c1cc(Br)c(Sc2ncccc2*)c(Br)c1"

UlrickFineddie avatar Oct 08 '25 15:10 UlrickFineddie

show_alais_in_smiles.ipynb, you can use this code to show the alias of the atom in SMILES. The code can transform ”c1cc(Br)c(Sc2ncccc2)c(Br)c1“ into ”Brc1cc([t-Bu])cc(Br)c1Sc1ncccc1[Me]“.

LingjieBao1998 avatar Oct 09 '25 06:10 LingjieBao1998