BioSimSpace How to preserve information that isn't stored in coordinate/topology files between nodes

On the latest devel version BioSimSpace-2019.1.0+297.gd5cc5b7 the attached BSS script fails on this pair of input files with this error message

 ~/biosimspace.app/bin/python custom_prepareFEP.py --input1 input/THROM\:throm_lig6a/* --input2 input/THROM\:throm_lig6b/* --output 6a_6b

Traceback (most recent call last):
  File "/export/users/julien/biosimspace.app/lib/python3.7/site-packages/BioSimSpace-2019.1.0+297.gd5cc5b7-py3.7.egg/BioSimSpace/FreeEnergy/_binding.py", line 117, in __init__
UserWarning: Exception 'SireBase::missing_property' thrown by the thread 'master'.
None of the contained forcefields have a property called water_model. Available properties are [ fileformat, space, time ].

This could be because BSS.Solvent was not used in this script to solvate system1 as it was already loaded solvated.

Inserting

from Sire import Base as _SireBase
system1._sire_object.setProperty("water_model", _SireBase.wrap("tip3p") )

Just before instantiating freerng solves the issue

I have uploaded a zip file with script and inputs to reproduce the error.

This points to an issue with the API to BSS.FreeEnergy . The constructor expects a merged molecule, but if that merged molecule hasn’t been obtained by loading solvated prm7/rst7 files in the script (rather than using solvate within the script) the system will miss properties required for the initialisation of the free energy object.

This means we cannot currently modularise the binding free energy pipeline by chaining different BSS nodes.

One solution would be to preserve more setup information in the system object (perhaps by saving pickles?), or to be more clever at guessing missing information (thought I could be difficult to guess what exact solvent model was used), or a combination of that.

custFEP.zip

Operating System: Linux
Installation method: latest BSS devel binary

Aug 13 '19 07:08 jmichel80

Thanks, @jmichel80. You are correct, we currently need to know the water topology in order to re-solvate the system for the free leg of a binding free energy simulation. I have code in SIre to detect water molecules, so it should be possible to guess the closest topology and use that. (Perhaps warning the user.)

However, the main cause of this limitation is a design flaw in SOMD, i.e. that it requires an AMBER water topology naming convention. This means that we need to know the water topology ahead of time in order to convert the water molecules in the system to the correct format required by SOMD. (For any free energy simulation, not just binding.) There is code in Sire to do this, but it doesn't auto-detect the topology.

I personally don't see the need for this restriction in SOMD since the code just seems to use the names in order to detect the waters. There is now code in Sire that can search for water molecules, e.g. system.search("water") so it should be an easy fix. It also would have just been easy to check the atoms by element type rather than name in the first place. I'll add a Sire issue for this to make sure it gets fixed.

Aug 13 '19 08:08 lohedges

I agree this points to a weakness in the current SOMD code, but there is still a general issue around the modularity of the BSS components. I do wonder what we can do to avoid losing information about a system when BSS nodes complete. In principle we could save a complete representation of the actual object rather than write output files but that raises question about how information is passed between nodes.

Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/

On Tue, Aug 13, 2019 at 9:42 AM Lester Hedges <[email protected]mailto:[email protected]> wrote:

Thanks, @jmichel80https://github.com/jmichel80. You are correct, we currently need to know the water topology in order to re-solvate the system for the free leg of a binding free energy simulation. I have code in SIre to detect water molecules, so it should be possible to guess the closest topology and use that. (Perahps warning the user.)

However, the main cause of this limitation is a design flaw in SOMD, i.e. that it requires an AMBER water topology naming convention. This means that we need to know the water topology ahead of time in order to convert the water molecules in the system to the correct format required by SOMD. (For any free energy simulation, not just binding.) There is code in Sire to do this, but it doesn't auto-detect the topology.

I personally don't see the need for this restriction in SOMD since the code just seems to use the names in order to detect the waters. There is now code in Sire that can search for water molecules, e.g. system.search("water") so it should be an easy fix. It also would have just been easy to check the atoms by element type rather than name in the first place. I'll add a Sire issue for this to make sure it gets fixed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/112?email_source=notifications&email_token=ACZN3ZDMA2PJG7QSPA5UVK3QEJXYTA5CNFSM4ILIBQK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4E7DBI#issuecomment-520745349, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACZN3ZATUL7RBTVZUPGVBO3QEJXYTANCNFSM4ILIBQKQ.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Aug 13 '19 10:08 jmichel80

I currently guess the water model topology if the information is missing in free_energy.py. For now, I'll add the same logic to the binding free energy code and warn the user, so that they are aware if the topology has changed.

I think a perturbable molecule reader would be a good goal too. This should be easy enough for the SOMD input, i.e. just reading the pert file and the lambda = 0 rst7/prm7. (Other than the naming/ordering not corresponding to the original system, so this may differ between nodes!). GROMACS is a little harder, but should be doable too.

Let me know if there are other things that you find are breaking the modularity. Most of this seems a special case for the free energy perturbation set-up/simulation, since additional information is required that is not stored the coordinate and topology files, e.g. the pert file itself and the name of the water topology. (We could always pass the water topology name to the FreeEnergy constructor to choose a water topology for the free leg.)

Aug 13 '19 11:08 lohedges

I have pushed a workaround for your issue which should work in most cases. The only thing I don't account for is detecting SPC/E rather than TIP3P. You should no longer need the following in your script:

from Sire import Base as _SireBase
system1._sire_object.setProperty("water_model", _SireBase.wrap("tip3p") )

Aug 13 '19 11:08 lohedges

Hi @jmichel80, is this workaround satisfactory for your use case? Assuming you'd like to solvate in one BioSimSpace node, then setup an FEP simulation in another, then BioSimSpace would correctly detect the water model that was used in the solvation stage, i.e. you wouldn't need to pass the name of the chosen water model through as an extra input argument.

As you suggest above, I think we should still have a discussion about what other information might need to be preserved between nodes and how we'd like to go about handling this. Ideally the node output should be human readable, so it might be easiest to have some kind of simple metadata file that's passed through. This could contain basic system properties such as the name of force field used to parameterise molecules, and the name of the water model, i.e. things that could be inferred from the information in the topology file, but isn't explicitly included there.

Sep 06 '19 08:09 lohedges

That sounds like a reasonable workaround (haven't tested in details yet).

For preserving information what about an optional yaml type file that can be passed around to help nodes work out what is in the input ?

Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/

On Fri, Sep 6, 2019 at 9:40 AM Lester Hedges <[email protected]mailto:[email protected]> wrote:

Hi @jmichel80https://github.com/jmichel80, is this workaround satisfactory for your use case? Assuming you'd like to solvate in one BioSimSpace node, then setup an FEP simulation in another, then BioSimSpace would correctly detect the water model that was used in the solvation stage, i.e. you wouldn't need to pass the name of the chosen water model through as an extra input argument.

As you suggest above, I think we should still have a discussion about what other information might need to be preserved between nodes and how we'd like to go about handling this. Ideally the node output should be human readable, so it might be easiest to have some kind of simple metadata file that's passed through. This could contain basic system properties such as the name of force field used to parameterise molecules, and the name of the water model, i.e. things that could be inferred from the information in the topology file, but isn't explicitly included there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/112?email_source=notifications&email_token=ACZN3ZA3KHTQSNS366N4AUTQIIJQ5A5CNFSM4ILIBQK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CFVQI#issuecomment-528767681, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACZN3ZF7KS6NEVXXXIWT4N3QIIJQ5ANCNFSM4ILIBQKQ.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Sep 06 '19 11:09 jmichel80

Yes, it could just be an additional record in the YAML file that is already created by a node.

Oct 16 '19 14:10 lohedges