BioSimSpace icon indicating copy to clipboard operation
BioSimSpace copied to clipboard

1:1 conversion between BSS and rdkit mol objects

Open JenkeScheen opened this issue 4 years ago • 10 comments

Recently I've had a few use-cases where it would have been helpful to be able to (within a BSS workflow) convert BSS molecules to RDKit molecules (and back). I worked around this by just writing molecules and loading them into RDKit, but I've been wondering if it would not be possible to implement something similar to BSS_mol.to_sire_object? I.e. BSS_mol.to_rdkit_object and BSS_mol.from_rdkit_object?

This would open up BSS workflows to a wide range of additional operations in rdkit that would help with FEP such as:

  • basic RDKit operations like generating 3D coordinates or flattening a molecule to 2D; saving a molecule image
  • more advanced RDKit stuff like Rgroup decomposition; molecular similarity calculations etc.

Would it be possible to translate the object structure directly, or could there be compatibility issues? I have observed before that atomic indexing can be incostistent between the APIs which might be a problem. Alternatively, a SMILES-based conversion might work, or even a temporary mol to/ from file.

JenkeScheen avatar Feb 24 '21 11:02 JenkeScheen

I would be very much in favour of this!

ppxasjsm avatar Feb 24 '21 11:02 ppxasjsm

Hi @JenkeScheen . I think this is a very good idea, and nicely fits the BioSimSpace philosophy, i.e. providing simplified wrappers around external tools, but given power-users full access to underlying objects where required. We do something similar with the BioSimSpace.Trajectory package, where the user can request the trjacectory in MDTraj or MDAnalysis format if they are more comfortable manipulating it that way.

As for a 1:1 conversion to RDKit: I don't have a good enough understanding of RDKit's underlying data structure to know if this is something that would be easy to use by combining the Sire and RDKit Python APIs. As a first pass we could choose to go through and intermediate file format that retains the most information possible. i.e. both BioSimSpace and RDKit can reconstruct what they need. With issues like atom numbering mismatches we could use something like BoSimSpace._SireWrappers.Molecule._makeCompatibleWith to make sure that the topology isn't monkeyed with if you do a round trip.

Eventually I could imagine things like .toOpenMMSystem / Molecule and toOpenFFSystem / Molecule too.

lohedges avatar Feb 24 '21 11:02 lohedges

Even if we do just go via intermediate files for the time being, we could certainly abstract this from the user.

lohedges avatar Feb 24 '21 12:02 lohedges

@JenkeScheen can you post examples of how this can be done currently in a hacky way (BSS->RDKit ; RDKit --> BSS) ? We can decide later whether this can and should be implemented in BSS>


Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/

On Wed, Feb 24, 2021 at 12:00 PM Lester Hedges <[email protected]mailto:[email protected]> wrote:

Even if we do just go via intermediate files for the time being, we could certainly abstract this from the user.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/190#issuecomment-785027359, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACZN3ZHJ2OJERPSVIRXSMC3TATS7HANCNFSM4YELX5VA.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

jmichel80 avatar Feb 24 '21 12:02 jmichel80

FYI: We also do multiple conversions behind the scenes for the OpenFF support, i.e. BSS --> RDKit --> OpenFF --> OpenMM --> ParmEd --> BSS. (Not all are 1:1 and involve intermediate files.)

lohedges avatar Feb 24 '21 12:02 lohedges

I second that this is a very good idea. I think it would be great if we could do API-level conversions (however this is achieved) which are transparent to the user. It would greatly improve interoperability.

chryswoods avatar Feb 24 '21 12:02 chryswoods

@jmichel80 I've not been doing anything fancy, just using the same input files mostly. However, an impromptu conversion would probably look like: Write to mol2 with BSS: BSS.IO.saveMolecules(mol_path, BSS_mol, "MOL2") Read into RDKit molecule (which sanitises by default which might break 1:1 conversion): RDKIT_mol = rdkit.Chem.rdmolfiles.MolFromMol2File(mol_path)

edit: happy to see people agree!

JenkeScheen avatar Feb 24 '21 12:02 JenkeScheen

Thanks, that's essentially the approach that we already take internally, e.g. see here in the MCS routine of BioSimSpace.Align. The one issue with file conversion is that RDKit has internal molecular sanitisation routines, and these perform better depending on the file format in question and how much information is in that file. If we do want to go via intermediates for the time being, then it would be good to work out what file format works best (SDF would be great, but we don't have a parser) and what formatting tweaks we might need to make in order to make things as robust as possible, e.g. we don't write CONECT records in PDB files, which are helpful to RDKit, and we don't infer Sybl atom types in Mol2 files either.

lohedges avatar Feb 24 '21 13:02 lohedges

just thinking about this again, would it not be simpler to just expose the OFF-based SMILES parser, then have a BSS-to-RDKit (and vice-versa) convertor that just passes the SMILES? I'm sure that the RDKit SMILES engine is more robust than the OFF one, but I doubt we'd ever have to deal with the funky molecules that RDKit has to.

JenkeScheen avatar Jul 12 '21 11:07 JenkeScheen

I don't think this solves the problem since we are not directly parsing SMILES into a molecular structure, rather using OFF to generate a "molecule" behind the scenes using whatever engine it prefers (presumably RDKit by default) then writing this to file (PDB) and reading back in. There is still no way of going directly from BSS/Sire to (say) RDKit without first going through an intermediate file format.

lohedges avatar Jul 12 '21 12:07 lohedges

Closing as this is now mostly implemented via Sire.

lohedges avatar Feb 27 '23 10:02 lohedges