It is possible to get atom index order by canonicalSmiles and smiles?
Background we would like to generate image with smiles and atom xy position while training some models.
Missing Functionality for Atom Index Mapping in SMILES and XY Coordinates
1. Problem Description:
The current functionality of Ketcher and the associated Indigo API does not provide a clear mapping between the atom order in SMILES (or canonical SMILES) and their indices or spatial positions (XY coordinates). This functionality is critical for tasks involving machine learning (ML), such as model training and molecular image generation with atom visualizations.
Users require a seamless mapping that connects:
- The atom order in SMILES (and canonical SMILES).
- The indices of the atoms in the molecular representation.
- The spatial (XY) coordinates of the atoms.
Hypothesis: The issue may arise from the absence of metadata (such as atom indices and positions) in the current API design. This functionality may not have been considered in the rendering and exporting architecture of the current version.
2. Steps to Reproduce:
- Open Ketcher Standalone (demo link).
- Draw a molecular structure, e.g., benzene (C₆H₆).
- Export the molecule using the
getSmiles()orgetMolfile()methods. - Attempt to map the atom order from the exported SMILES/canonical SMILES to the original atom indices in the
molfileor editor model (Canvas).- For example, consider the molecule:
-
SMILES:
c1ccccc1 - Atom indices in the editor: 0, 1, 2, 3, 4, 5.
-
SMILES:
- Note that the atom order in SMILES cannot be mapped to the in-editor indices.
- For example, consider the molecule:
- Try extracting the XY coordinates of the atoms via the API and linking them to the atom order in SMILES.
3. Expected Behavior:
The Ketcher/Indigo API should provide a mechanism to:
- Retrieve the atom order in SMILES/Canonical SMILES.
- Map the atom indices in SMILES to their original indices in the molecular model.
- Associate the spatial coordinates (XY) of the atoms with their indices.
Desired functionality: The API should include a new method that returns the data in the following format:
{
"smiles": "c1ccccc1",
"canonical_smiles": "c1ccccc1",
"atom_mapping":
[
{ "original_index": 0, "smiles_index": 5, "x": 100, "y": 100 },
{ "original_index": 1, "smiles_index": 4, "x": 120, "y": 110 },
...
]
}
4. Actual Behavior:
Currently:
- The methods
getSmiles()andgetCanonicalSmiles()return only the SMILES/canonical SMILES string without atom metadata. -
getMolfile()contains atom information with original indices, but it is difficult to associate these indices with the SMILES order. - There is no automatic mechanism to map:
- Atom indices in SMILES/canonical SMILES.
- Original atom indices in the in-editor structure.
- XY coordinates of the atoms.
This creates challenges for users attempting to use the data for visualization or ML model training purposes.
5. Analysis of the Problem:
Root Causes:
- SMILES is a text-based representation of molecules that simplifies data exchange but does not include physical molecular coordinates.
- Canonical SMILES reorders atoms to generate a unique string representation. This order is algorithmically determined and stored locally but is not linked to spatial coordinates or indices in the current API.
- The issue likely arises from the absence of a unified mechanism in Indigo Toolkit/StructService to integrate atom mapping data.
Significance in lifescience:
This issue limits the use of Ketcher and Indigo Toolkit for synthetic data creation, a critical requirement for modeling and scientific research.
6. Suggested Solutions (Suggested Solutions):
High-Level Solution:
Expand the capability of Ketcher to generate mapping data that includes:
- The ordered list of atoms in SMILES/canonical SMILES.
- A mapping between the original atom indices and their order in SMILES/canonical SMILES.
- The association of atom coordinates (XY) with their indices.
Technical Solution:
-
Add a New Mapping Method:
- Extend
StructServiceto include agetAtomMapping()function:async function getAtomMapping(ketcher) { const molfile = ketcher.getMolfile(); const smiles = await ketcher.getSmiles(); const canonicalSmiles = await ketcher.getCanonicalSmiles(); const atomMapping = molfile.atoms.map((atom, index) => ({ original_index: atom.index, smiles_index: smiles.indexOf(atom.symbol), // Pseudocode. x: atom.pp.x, y: atom.pp.y })); return { smiles, canonicalSmiles, atomMapping }; } - This method should return SMILES, canonical SMILES, and atom mapping data (indices in SMILES, original indices, and XY coordinates).
- Extend
-
Enhance Indigo Toolkit: Modify the functionality of the
canonicalSmiles()andsmiles()methods to return atom index mappings:def get_smiles_atom_map(molecule): canonical_smiles = molecule.canonicalSmiles() atom_map = [] for index, atom in enumerate(molecule.iterateAtoms()): atom_map.append({ "smiles_index": index, "original_index": atom.index(), "x": atom.getX(), "y": atom.getY() }) return canonical_smiles, atom_map -
Documentation: Update the documentation to include:
- A step-by-step guide on how to retrieve SMILES with index mapping.
- Examples demonstrating how to use the new API for tasks involving data preparation for ML.
9. Additional Information (Additional Information):
-
Inputs: Example SMILES:
C1=CC=CC=C1orC(CO)O. -
Sample Data:
- SMILES:
"C(CO)O". - Atom mapping:
[ { "smiles_index": 0, "original_index": 2, "x": 10.0, "y": 10.0 }, { "smiles_index": 1, "original_index": 1, "x": 20.0, "y": 15.0 } ]
- SMILES: