Some question about failure in recognization
Hello, I would like to ask, did the failure in recognizing these images occur because of the R group, the coordination bond, or the depiction of the five-membered ring as shown in the figure?
Hello @sunyrain ,
So here the issue is that for the 5-membered ring to be aromatic, there must be a negative charge, which is not shown here. The predicted smiles is (for the first one but the problem is the same for the second one): C*C.[1*]C1=CC=C([2*])c2cc(*c3cc4c(c3)C([4*])=CC=C4[3*])cc21 which is incorrect and Chem.Sanitize(mol) willl fail as it would require one of the atoms, in the five membered rings, that have 3 heavy atom neighbours to bear a negative charge.
Here is a function to make a correction for the 5 membered ring but it can be improved.
The function implemented a version of "mol.GetRingInfo().AtomRings()" as it will return an empty set because the initial molecule is not valid. (But I am interested if someone can have a simpler and faster solution)
The function should be called before _" mol = verify_chirality(mol, coords, symbols, edges, debug)" at https://github.com/thomas0809/MolScribe/blob/main/molscribe/chemistry.py#L553
after this, this is the new output: 'C*C.[1*]c1ccc([2*])[c-]2cc(*[c-]3cc4c([3*])ccc([4*])c4c3)cc12' which is now valid.
def correct_charges_aromatic_rings(mol):
if mol is None:
return mol
N = mol.GetNumAtoms()
if N == 0:
return mol
# build adjacency list from bonds (works on unsanitized mols)
adj = {i: [] for i in range(N)}
for b in mol.GetBonds():
a = b.GetBeginAtomIdx()
c = b.GetEndAtomIdx()
adj[a].append(c)
adj[c].append(a)
TARGET_LEN = 5
cycles = set()
# DFS that only expands to nodes with index > start to avoid duplicate cycles
def dfs(current, start, path):
if len(path) > TARGET_LEN:
return
for nbr in adj[current]:
if nbr == start and len(path) == TARGET_LEN:
# path starts with start by construction
tup = tuple(path)
rev = tup[::-1]
# canonicalize orientation (start is already the smallest index by our DFS rule)
cycles.add(tup if tup <= rev else rev)
# only expand to nodes strictly greater than start to ensure start is the smallest in the cycle
elif nbr > start and nbr not in path and len(path) < TARGET_LEN:
dfs(nbr, start, path + [nbr])
# run DFS from each start node
for s in range(N):
dfs(s, s, [s])
# use RWMol for safe in-place modifications
rw = Chem.RWMol(mol)
if cycles:
# For each distinct 5-cycle, modify only the first matching atom (valence == 4)
for cyc in cycles:
corrected = False
for idx in cyc:
atom = rw.GetAtomWithIdx(idx)
try:
if atom.GetExplicitValence() == 4:
atom.SetFormalCharge(-1)
corrected = True
break
except Exception:
# if atom properties are weird, skip it
continue
else:
ring_atoms = [i for i in range(N) if mol.GetAtomWithIdx(i).IsInRingSize(TARGET_LEN)]
seen = set()
for i in ring_atoms:
if i in seen:
continue
comp = []
stack = [i]
seen.add(i)
while stack:
u = stack.pop()
comp.append(u)
for v in adj[u]:
if v in ring_atoms and v not in seen:
seen.add(v)
stack.append(v)
# apply correction once for this connected group
for idx in comp:
atom = rw.GetAtomWithIdx(idx)
try:
if atom.GetExplicitValence() == 4:
atom.SetFormalCharge(-1)
break
except Exception:
continue
return rw.GetMol()