to/from smi interoperability between OE/RDkit
Describe the bug
Round trip smi using Molecule.to_file and Molecule.from_file is inconsistent, and at worse fails for the RDkit case.
To Reproduce Showing the cases of from->to, showing RD->RD, OE->OE, OE->RD, RD->OE, using the following rough template code as the test:
from openforcefield.topology import Molecule
Molecule.from_smiles('CC').to_file('x.smi', file_format='smi')
Molecule.from_file('x.smi')
Molecule.to_file('x2.smi', file_format='smi')
So, for example in the RD->OE case, the x.smi file is saved using RDkit, then is loaded and saved using OE (x2.smi).
- Using RD->RD
Fails with pure RDkit because the file saved (
x.smi) is
SMILES Name
[H]C([H])([H])C([H])([H])[H] 0
and it attempts to parse the header in from_file.
- Using OE->OE works, and the file saved is
CC
- Using OE->RD Produces this warning:
[11:57:19] WARNING: no name column found on line 0
and x2.smi is the same as the pure RD case (header with name 0)
- Using RD->OE Works, as it continues with parsing after failing to parse the header, producing this warning:
Warning: Problem parsing SMILES:
Warning: SMILES Name
Warning: ^
Warning: Error reading molecule "" in Canonical stereo SMILES format.
and the x2.smi now is
CC 0
The only significant bug is the RD->RD case, since it bails on trying to parse the header. OE, in a similar situation, just emits a warning and continues.
Another problematic aspect is that, in the RD->OE case, we supply an explicit hydrogen smiles, but OE strips them in to_file.
Thanks for piecing this one out
@j-wags on Slack:
Names: I'd consider that innocuous, and stick with RDKit's default behavior (print two columns).
Hydrogens: This is complicated. and basically our Molecule.to_smiles method has some guarantees about the format of SMILES that it will output (explicit hydrogen, canonical, isomeric). But Molecule.to_file uses RDKit's default SMILES writer, which has some simple default settings. We could look at whether the RDKit SMILES writer has flags we could set to make it behave like our to_smiles methods.
The last point seems like the easiest path forward. Given the well-known complexities, trying to get three toolkits to treat SMILES interoperability the same way in all cases is probably not going to happen. But having some baselines similarities with default arguments - i.e. including hydrogens or not - would be great. Each wrappers' to_smiles method includes hydrogens, as does RDKitToolkitWrapper.to_file, but OpenEyeToolkitWrapper.to_file does not.
If this is the case, why not just have to_file('x.smi', file_format='smi', kwargs) just point to to_smiles(x.smi, **kwargs) for their respective toolkits?
Hm, so options are:
- Do nothing and continue failing round trips
- Modify the RDKit file writer with some or all of the same kwargs as we use in its
to_smilesmethod, which will have us supporting parallel codebases into_smilesandto_filein the future - Trevor's suggestion, where we intercept
to_file(..., file_format='smi'...)and thenwith open(filename,'w') as of: of.write(mol.to_smiles(**kwargs))
Now that I've typed out... pretty much the whole thing, the last option seems pretty sweet. Does anyone see unanticipated problems/complexity with us doing our own writes instead of relying on the toolkits?
#762 is related but I don't think ultimately will fix this
Do we guarantee safe roundtrips for this file format? I don't think we do.
And is there a canonical specification for the format of this file?