openff-toolkit icon indicating copy to clipboard operation
openff-toolkit copied to clipboard

to/from smi interoperability between OE/RDkit

Open trevorgokey opened this issue 5 years ago • 6 comments

Describe the bug Round trip smi using Molecule.to_file and Molecule.from_file is inconsistent, and at worse fails for the RDkit case.

To Reproduce Showing the cases of from->to, showing RD->RD, OE->OE, OE->RD, RD->OE, using the following rough template code as the test:

from openforcefield.topology import Molecule
Molecule.from_smiles('CC').to_file('x.smi', file_format='smi')
Molecule.from_file('x.smi')
Molecule.to_file('x2.smi', file_format='smi')

So, for example in the RD->OE case, the x.smi file is saved using RDkit, then is loaded and saved using OE (x2.smi).

  1. Using RD->RD Fails with pure RDkit because the file saved (x.smi) is
SMILES Name
[H]C([H])([H])C([H])([H])[H] 0

and it attempts to parse the header in from_file.

  1. Using OE->OE works, and the file saved is
CC
  1. Using OE->RD Produces this warning:
[11:57:19] WARNING: no name column found on line 0

and x2.smi is the same as the pure RD case (header with name 0)

  1. Using RD->OE Works, as it continues with parsing after failing to parse the header, producing this warning:
Warning: Problem parsing SMILES:
Warning: SMILES Name 
Warning:  ^

Warning: Error reading molecule "" in Canonical stereo SMILES format.

and the x2.smi now is

CC 0

The only significant bug is the RD->RD case, since it bails on trying to parse the header. OE, in a similar situation, just emits a warning and continues.

Another problematic aspect is that, in the RD->OE case, we supply an explicit hydrogen smiles, but OE strips them in to_file.

trevorgokey avatar Jul 30 '20 19:07 trevorgokey

Thanks for piecing this one out

@j-wags on Slack:

Names: I'd consider that innocuous, and stick with RDKit's default behavior (print two columns).

Hydrogens: This is complicated. and basically our Molecule.to_smiles method has some guarantees about the format of SMILES that it will output (explicit hydrogen, canonical, isomeric). But Molecule.to_file uses RDKit's default SMILES writer, which has some simple default settings. We could look at whether the RDKit SMILES writer has flags we could set to make it behave like our to_smiles methods.

The last point seems like the easiest path forward. Given the well-known complexities, trying to get three toolkits to treat SMILES interoperability the same way in all cases is probably not going to happen. But having some baselines similarities with default arguments - i.e. including hydrogens or not - would be great. Each wrappers' to_smiles method includes hydrogens, as does RDKitToolkitWrapper.to_file, but OpenEyeToolkitWrapper.to_file does not.

mattwthompson avatar Jul 30 '20 20:07 mattwthompson

If this is the case, why not just have to_file('x.smi', file_format='smi', kwargs) just point to to_smiles(x.smi, **kwargs) for their respective toolkits?

trevorgokey avatar Jul 30 '20 21:07 trevorgokey

Hm, so options are:

  1. Do nothing and continue failing round trips
  2. Modify the RDKit file writer with some or all of the same kwargs as we use in its to_smiles method, which will have us supporting parallel codebases in to_smiles and to_file in the future
  3. Trevor's suggestion, where we intercept to_file(..., file_format='smi'...) and then with open(filename,'w') as of: of.write(mol.to_smiles(**kwargs))

Now that I've typed out... pretty much the whole thing, the last option seems pretty sweet. Does anyone see unanticipated problems/complexity with us doing our own writes instead of relying on the toolkits?

j-wags avatar Aug 13 '20 00:08 j-wags

#762 is related but I don't think ultimately will fix this

mattwthompson avatar Nov 05 '20 22:11 mattwthompson

Do we guarantee safe roundtrips for this file format? I don't think we do.

mattwthompson avatar Feb 01 '23 19:02 mattwthompson

And is there a canonical specification for the format of this file?

mattwthompson avatar Feb 01 '23 20:02 mattwthompson