MolScribe icon indicating copy to clipboard operation
MolScribe copied to clipboard

Some question about failure in recognization

Open sunyrain opened this issue 1 year ago • 1 comments

Hello, I would like to ask, did the failure in recognizing these images occur because of the R group, the coordination bond, or the depiction of the five-membered ring as shown in the figure? image 屏幕截图 2024-07-14 210820

sunyrain avatar Jul 14 '24 14:07 sunyrain

Hello @sunyrain ,

So here the issue is that for the 5-membered ring to be aromatic, there must be a negative charge, which is not shown here. The predicted smiles is (for the first one but the problem is the same for the second one): C*C.[1*]C1=CC=C([2*])c2cc(*c3cc4c(c3)C([4*])=CC=C4[3*])cc21 which is incorrect and Chem.Sanitize(mol) willl fail as it would require one of the atoms, in the five membered rings, that have 3 heavy atom neighbours to bear a negative charge.

Here is a function to make a correction for the 5 membered ring but it can be improved.

The function implemented a version of "mol.GetRingInfo().AtomRings()" as it will return an empty set because the initial molecule is not valid. (But I am interested if someone can have a simpler and faster solution)

The function should be called before _" mol = verify_chirality(mol, coords, symbols, edges, debug)" at https://github.com/thomas0809/MolScribe/blob/main/molscribe/chemistry.py#L553

after this, this is the new output: 'C*C.[1*]c1ccc([2*])[c-]2cc(*[c-]3cc4c([3*])ccc([4*])c4c3)cc12' which is now valid.

def correct_charges_aromatic_rings(mol):
    if mol is None:
        return mol

    N = mol.GetNumAtoms()
    if N == 0:
        return mol

    # build adjacency list from bonds (works on unsanitized mols)
    adj = {i: [] for i in range(N)}
    for b in mol.GetBonds():
        a = b.GetBeginAtomIdx()
        c = b.GetEndAtomIdx()
        adj[a].append(c)
        adj[c].append(a)

    TARGET_LEN = 5
    cycles = set()

    # DFS that only expands to nodes with index > start to avoid duplicate cycles
    def dfs(current, start, path):
        if len(path) > TARGET_LEN:
            return
        for nbr in adj[current]:
            if nbr == start and len(path) == TARGET_LEN:
                # path starts with start by construction
                tup = tuple(path)
                rev = tup[::-1]
                # canonicalize orientation (start is already the smallest index by our DFS rule)
                cycles.add(tup if tup <= rev else rev)
            # only expand to nodes strictly greater than start to ensure start is the smallest in the cycle
            elif nbr > start and nbr not in path and len(path) < TARGET_LEN:
                dfs(nbr, start, path + [nbr])

    # run DFS from each start node
    for s in range(N):
        dfs(s, s, [s])

    # use RWMol for safe in-place modifications
    rw = Chem.RWMol(mol)

    if cycles:
        # For each distinct 5-cycle, modify only the first matching atom (valence == 4)
        for cyc in cycles:
            corrected = False
            for idx in cyc:
                atom = rw.GetAtomWithIdx(idx)
                try:
                    if atom.GetExplicitValence() == 4:
                        atom.SetFormalCharge(-1)
                        corrected = True
                        break
                except Exception:
                    # if atom properties are weird, skip it
                    continue
    else:
        ring_atoms = [i for i in range(N) if mol.GetAtomWithIdx(i).IsInRingSize(TARGET_LEN)]
        seen = set()
        for i in ring_atoms:
            if i in seen:
                continue
            comp = []
            stack = [i]
            seen.add(i)
            while stack:
                u = stack.pop()
                comp.append(u)
                for v in adj[u]:
                    if v in ring_atoms and v not in seen:
                        seen.add(v)
                        stack.append(v)
            # apply correction once for this connected group
            for idx in comp:
                atom = rw.GetAtomWithIdx(idx)
                try:
                    if atom.GetExplicitValence() == 4:
                        atom.SetFormalCharge(-1)
                        break
                except Exception:
                    continue

    return rw.GetMol()

UlrickFineddie avatar Oct 08 '25 14:10 UlrickFineddie