MolScribe Options for Not Expanding Abbreviated Structures During Prediction

For my specific use case, I am interested in predicting the graphs without expanding the abbreviated structures. I have been trying to understand the code and the sequence of operations it performs, but it’s not entirely clear to me whether this is possible with the current implementation.

Could you please clarify if there is an option or a straightforward way to modify the code to achieve this? Any guidance or suggestions would be greatly appreciated.

Thank you for your time and assistance.

Jun 27 '24 17:06 zuhalcakir

Hello @zuhalcakir ,

Can you develop a bit on your goal?

Because, in the current implementation, if there are abbreviated group, they are still stored in the prediction['molfile'] and the prediction's graph (i.e prediction['atom'] and prediction['bond']). And if for example you perform: Chem.MolToSmiles(Chem.MolFromMolBlock(pred[0]['molfile']), you will get the smiles where those groups are not developped, just put as wildcard.

But can you elaborate more? may be I didn't really understand the goal.

Here is an example:

For this mol, the output is :

[{'smiles': 'Cc1cccnc1Sc1c(Br)cc(C(C)(C)C)cc1Br', 'molfile': '\n RDKit 2D\n\n 17 18 0 0 0 0 0 0 0 0999 V2000\n 15.8463 -2.6508 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 15.8463 0.0476 0.0000 R 0 0 0 0 0 0 0 0 0 0 0 0\n 13.5414 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 11.5246 -2.6508 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0\n 9.2197 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 -2.6508 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 0.0476 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0\n 4.8980 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 4.8980 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 2.8812 -8.0476 0.0000 R 0 0 0 0 0 0 0 0 0 0 0 0\n 7.2029 -8.0476 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 9.2197 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 11.8127 -8.0476 0.0000 Br 0 0 0 0 0 0 0 0 0 0 0 0\n 13.5414 -6.6190 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0\n 15.8463 -8.0476 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 17.8631 -6.6190 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 17.8631 -3.9206 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n 1 2 1 0\n 1 3 2 0\n 1 17 1 0\n 3 4 1 0\n 3 14 1 0\n 4 5 1 0\n 5 6 2 0\n 5 12 1 0\n 6 7 1 0\n 6 8 1 0\n 8 9 2 0\n 9 10 1 0\n 9 11 1 0\n 11 12 2 0\n 12 13 1 0\n 14 15 2 0\n 15 16 1 0\n 16 17 2 0\nA 2\nMe\nA 10\nt-Bu\nM END\n', 'confidence': 0.8928430181866542, 'atoms': [{'atom_symbol': 'C', 'x': 0.873015873015873, 'y': 0.36507936507936506, 'confidence': 0.6537588238716125}, {'atom_symbol': '[Me]', 'x': 0.873015873015873, 'y': 0.09523809523809523, 'confidence': 0.9198883671557538}, {'atom_symbol': 'C', 'x': 0.746031746031746, 'y': 0.49206349206349204, 'confidence': 0.9090003967285156}, {'atom_symbol': 'S', 'x': 0.6349206349206349, 'y': 0.36507936507936506, 'confidence': 0.90590500831604}, {'atom_symbol': 'C', 'x': 0.5079365079365079, 'y': 0.49206349206349204, 'confidence': 0.9058762788772583}, {'atom_symbol': 'C', 'x': 0.3968253968253968, 'y': 0.36507936507936506, 'confidence': 0.8727304339408875}, {'atom_symbol': 'Br', 'x': 0.3968253968253968, 'y': 0.09523809523809523, 'confidence': 0.9228599590457234}, {'atom_symbol': 'C', 'x': 0.2698412698412698, 'y': 0.49206349206349204, 'confidence': 0.9096300005912781}, {'atom_symbol': 'C', 'x': 0.2698412698412698, 'y': 0.7619047619047619, 'confidence': 0.905170738697052}, {'atom_symbol': '[t-Bu]', 'x': 0.15873015873015872, 'y': 0.9047619047619048, 'confidence': 0.9341090351133937}, {'atom_symbol': 'C', 'x': 0.3968253968253968, 'y': 0.9047619047619048, 'confidence': 0.9104658365249634}, {'atom_symbol': 'C', 'x': 0.5079365079365079, 'y': 0.7619047619047619, 'confidence': 0.8941925168037415}, {'atom_symbol': 'Br', 'x': 0.6507936507936508, 'y': 0.9047619047619048, 'confidence': 0.9174843038480519}, {'atom_symbol': 'N', 'x': 0.746031746031746, 'y': 0.7619047619047619, 'confidence': 0.9128627777099609}, {'atom_symbol': 'C', 'x': 0.873015873015873, 'y': 0.9047619047619048, 'confidence': 0.9044473171234131}, {'atom_symbol': 'C', 'x': 0.9841269841269841, 'y': 0.7619047619047619, 'confidence': 0.9079452753067017}, {'atom_symbol': 'C', 'x': 0.9841269841269841, 'y': 0.49206349206349204, 'confidence': 0.9077780246734619}], 'bonds': [{'bond_type': 'single', 'endpoint_atoms': (0, 1), 'confidence': 1.0}, {'bond_type': 'single', 'endpoint_atoms': (0, 2), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (0, 16), 'confidence': 0.9999987483024597}, {'bond_type': 'single', 'endpoint_atoms': (2, 3), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (2, 13), 'confidence': 0.9999998807907104}, {'bond_type': 'single', 'endpoint_atoms': (3, 4), 'confidence': 1.0}, {'bond_type': 'single', 'endpoint_atoms': (4, 5), 'confidence': 1.0}, {'bond_type': 'double', 'endpoint_atoms': (4, 11), 'confidence': 0.9999994039535522}, {'bond_type': 'single', 'endpoint_atoms': (5, 6), 'confidence': 0.9815445840358734}, {'bond_type': 'double', 'endpoint_atoms': (5, 7), 'confidence': 0.9999999403953552}, {'bond_type': 'single', 'endpoint_atoms': (7, 8), 'confidence': 0.9999995827674866}, {'bond_type': 'single', 'endpoint_atoms': (8, 9), 'confidence': 0.9999995827674866}, {'bond_type': 'double', 'endpoint_atoms': (8, 10), 'confidence': 0.9999983906745911}, {'bond_type': 'single', 'endpoint_atoms': (10, 11), 'confidence': 0.9999998807907104}, {'bond_type': 'single', 'endpoint_atoms': (11, 12), 'confidence': 0.9999109506607056}, {'bond_type': 'single', 'endpoint_atoms': (13, 14), 'confidence': 0.9999999403953552}, {'bond_type': 'double', 'endpoint_atoms': (14, 15), 'confidence': 0.9999999403953552}, {'bond_type': 'single', 'endpoint_atoms': (15, 16), 'confidence': 1.0}]}]

We see that the abbreviated groups are still there. And print(Chem.MolToSmiles(Chem.MolFromMolBlock(molscribe_pred[0]['molfile']))) give: "*c1cc(Br)c(Sc2ncccc2*)c(Br)c1"

Oct 08 '25 15:10 UlrickFineddie

show_alais_in_smiles.ipynb, you can use this code to show the alias of the atom in SMILES. The code can transform ”c1cc(Br)c(Sc2ncccc2)c(Br)c1“ into ”Brc1cc([t-Bu])cc(Br)c1Sc1ncccc1[Me]“.

Oct 09 '25 06:10 LingjieBao1998