BioSimSpace icon indicating copy to clipboard operation
BioSimSpace copied to clipboard

[TUTORIAL] Steered MD

Open lohedges opened this issue 4 years ago • 73 comments

This is a thread to discuss the creation of a tutorial showing how to implement steered molecular dynamics in BioSimSpace.

lohedges avatar Mar 02 '21 16:03 lohedges

Adding current diagram summarising use case:

steered_md-white

jmichel80 avatar Mar 03 '21 21:03 jmichel80

I have started testing GROMACS+PLUMED for steered MD here.

I looked at BioSimSpace.Metadynamics.CollectiveVariable and the only available ones at the moment are distance, funnel and torsion. So for now I prepared PLUMED input and structure reference files as I did before, and it was simple to add PLUMED used usage to BioSimSpace.Protcol.Production through using BioSimSpace.Process.Gromacs.setArgs().

I have also looked at creating a BioSimSpace.Protcol.Metadynamics with a dummy BioSimSpace.Metadynamics.CollectiveVariable.Distance and simply modify the plumed.dat file to change to RMSD, but it uses METAD whereas I've been using MOVINGRESTRAINT. An alternative would be to simply replace that file completely, and that would save the need to extend the command line arguments of the protocol. That however doesn't change the need for an appropriately numbered and weighted reference structure which I don't think BSS prepares at the moment.

AdeleHardie avatar Mar 04 '21 13:03 AdeleHardie

Following today's discussion, I'm linking the functions I use to prepare input PLUMED files for steered MD. renumber_pdb() and rmsd_reference() here are used to prepare the reference PDB structure, and steered_md_plumed_input() here prepares plumed.dat.

The CV itself (RMSD here) and the MOVINGRESTRAINT in PLUMED are separate so it should be easy to extend this to other CVs.

AdeleHardie avatar Mar 18 '21 15:03 AdeleHardie

Thanks, @AdeleLip, that's really useful. I'll see what's possible to re-implement within BioSimSpace.

FYI: I've just pushed an update that adds a .getFrame method to all BioSimSpace.Process classes. This allows you to extract individual trajectory frames from a process, rather than first having to get the entire trajectory, or pass the full paths to the trajectory and top files to BioSimSpace.Trajectory.getFrame. The indices are validated against the expected range and the method returns None if a corresponding frame isn't found. Note that the SOMD trajectory writer is a bit funky so there is sometimes a mismatch between the number of frames written in the header and what is actually in the file. GROMACS also is sometimes off by one too, i.e. there is one extra frame in the file.

lohedges avatar Mar 18 '21 16:03 lohedges

Yes, most functionality for doing this already exists in BioSimSpace so I'll cook up an example tomorrow. Do you have example input files for the system and target (RMSD reference) that you could share? (Perhaps they are already in the repository, but I missed them.)

Would the system and target normally be separate files to begin with, or could the target be extracted from the system with the coordinates replaced with the those of the desired target conformation? I'm just trying to figure out the logical workflow within BioSimSpace, i.e. would the user load two PDB files to start with, one for the system, one for the target, or could we actually create the target in BioSimSpace? (Apologies for any incorrect notation.)

lohedges avatar Mar 18 '21 17:03 lohedges

Yes I've adjusted the trajectory analysis notebook to use getFrame() and it works great for extracting snapshots.

Here are some example input files. The system files are just results of equilibration, and the target is a crystal structure (with a bit of modification since I use it to prepare other systems).

The way I prepare RMSD reference is to simply take the PDB file and replace the atom indices with the corresponding atom indices from the system itself. I wrote that code very early on (before I used BSS), but it simply loops over all of the residues of the system and the target protein, and within each residue finds atoms with the same names, then changes target protein indices with the ones from the system. This assumes that the proteins have the same sequence start (e.g. in my case the target has to have an ACE cap as well) but can handle things like hydrogens missing from the reference PDB (so you can get residues with indices like [6, 8, 10] where the hydrogens were just ignored).

One of the things that it doesn't handle (which I think it really should) is to check that the sequences match, e.g. that the residues with the same index have the same name. Because of the nature of my system, the renumbering gets cut off when it reaches the NME cap, which in the target is a GLY, so only the N atom matches and gets carried over. It isn't used for RMSD calculations and doesn't affect the alignment for this particular system but this is a patchy solution.

It could also be possible to replace the system coordinates with those extracted from the PDB, and then save the BioSimSpace molecule as a PDB for plumed to use. I get the indices directly from the whole system PDB, which can be useful especially if the protein isn't the first molecule, but I remember you mentioning you have ways to deal with that.

AdeleHardie avatar Mar 18 '21 18:03 AdeleHardie

Thanks for the detailed explanation. For the example files that you provided there are residues in the target PDB that aren't found in the system. Using the Sire/BioSimSpace search functionality I find 284 out of the 301 residues. Is this to be expected? If so, can I safely ignore any unmatched residues and only update the indices of the ones that I find.

With regards to the PDB file required by PLUMED: Do you use the target PDB with the indices updated (and the beta and occupancy columns tweaked)? Would it also work if I wrote a PDB for the corresponding molecule in the system after updating the coordinates? (Perhaps only writing the residues that match.)

lohedges avatar Mar 19 '21 10:03 lohedges

In this case the target protein (WPD loop closed) sequence is a bit longer than the system (WPD loop open) since the C terminus of the WPD loop open is disordered and we are modelling this protein truncated. It is safe to ignore any unmatched residues, since PLUMED doesn't care about continuity, just the atom index. In some of their RMSD examples they only pass the few atoms they want to use for the calculation (we need the others for system alignment, but it illustrates the point).

I use the target PDB with the indices updated and then the beta and occupancy columns tweaked. All PLUMED needs is that the indices are the same between the target and the system, and the two columns let it know which indices to use for alignment and which for RMSD calculation. So if it's easier in BSS to find the matching residues/atoms and update coordinates in some copy of the system that should work perfectly.

AdeleHardie avatar Mar 19 '21 11:03 AdeleHardie

I have now added some more tutorial background in the setup-sMD notebook. Some of it is subject to change depending on what you decide to implement in BSS, but it might also give more background on how I prepare the simulation at the moment.

AdeleHardie avatar Mar 19 '21 14:03 AdeleHardie

Okay, here is an archive containing a hokey script that uses Sire/BioSimSpace to create a PDB file for PLUMED. Just run test.py in the extracted directory, which should create a test.pdb output (which is already included, in case you don't want to run it.) Could you check this file and see if it all looks okay?

The logic is as follows:

  • Load the system and the target.
  • Match residues in the target to those in the system by name, and work out the unique molecule to which they match, i.e. all target residues must match to a single molecule of interest.
  • Remove duplicate matches (there will be residues with the same name) by looping through the residues in the target and taking the first match from the system until we've exhausted all matches. If the residue was already matched, we take the next, and so on. As mentioned above, we don't end up matching all residues from the target, since the molecule in the system is a truncated version. The molecule in the system has 284 residues and the target has 301. I match 277 of the target residues.
  • Create a mapping between the residue indices in the system to those in the target.
  • Loop over residues in the system. If it is in the map, then loop through atoms and match by name against the first match in the target, copy across coordinates and set occupancy and beta factor. If it's not, then use normal coords and set occupancy / beta to the opposite value.

The code would be simplified massively if we could be sure that the target PDB file contained all residues from the matching molecule in the system, since we could use the ResIdxAtomNameMatcher from Sire.Mol. This would directly return a mapping between the atoms in the system that match those in the target. Here I guess the target PDB might contain a subset of the molecule, so we might be missing residues, or there could be gaps in the indexing. Let me know if this isn't the case and I'll simplify the code. (The current implementation could be causing by taking the first match it finds.)

I've not extensively tested this, so let me know if there is something obviously wrong with the logic. For example, I'm not sure if I'm correctly setting the occupancy and beta factors for atoms that aren't matched, or whether I'm including too many atoms in the PDB. (It contains the entire molecule from the system.) It should also work for cases where the matched molecule isn't the first in the system (which it is in this example). The PDB writer already uses the atom number, which accumulates across all molecules, rather than the index, which counts from 0 for each molecule.

It should be easy for me to revise this to get something that works a little more robustly.

Cheers.

lohedges avatar Mar 19 '21 16:03 lohedges

Note that if we don't need additional info in the PDB file, e.g. MODEL and CONECT records, then we can use Sire.IO.PDB2 directly and strip these from the output.

lohedges avatar Mar 19 '21 16:03 lohedges

I think the problem at the moment is the beta and occupancy columns - with how they are now, PLUMED will use the atoms that do not exist in the target structure to align the system and the all the atoms that matched to calculate the RMSD.

I can see you are copying the molecule and updating coordinates. In that context the logic once you've created the atom mapping I think should be:

  1. Loop over each atom in the copy molecule
  2. If it does not exist in target, delete it
  3. If it does exist in the target but does not fall in the 'RMSD residue' range, set the occupancy column to 1.00 and beta to 0.00
  4. If it does exist in the target and does fall in the 'RMSD residue' range, set the occupancy column to 0.00 and beta to 1.00

At the moment the alignment will be carried out to the first frame of the system's H atoms and some left over residues, which wouldn't work to correctly align to the target structure, and the RMSD would be calculated to the entire protein, which also wouldn't make steering possible. I'm attaching example renumbered and modified target files to illustrate this (apologies, I should have included them before).

About simplifying the code, in this case it would be easy to make the target PDB contain only what it contained in the system, since I could just also truncate it and leave out the N terminus cap, that would only require deleting some lines. But in the cases where it is swapped around, i.e. the system has the loop closed (and 301 residues) and the target has the loop open (and 284 residues) that would involve adding those residues to the target structure, and I'm not sure if the additional residues would throw off alignments. This is what I meant by saying RMSD is annoying to work with...

I see the script does a lot of things to determine which molecule the target structure is for - could we just ask the users to specify this? From my own usage of the scripts, I wouldn't mind that at all. Then there could be two assumptions to make:

  1. All the residues in the target PDB appear in the system
  2. All the residues indices in the target PDB are the same as in the system (i.e. residue 20 in target is the same as residue 20 in system, even though some atoms may be missing)

I feel like those are reasonable expectations and would only need some modification when preparing target structure. This would be able to handle mismatched sequence length (assuming they are numbered accordingly) and also gaps in the system (full residues and one off atoms). The only issue for the user would be to make sure the residues are numbered correctly if there is an offset (e.g. sequence start is longer in the system or the target) or if there are any gaps. What do you think?

AdeleHardie avatar Mar 19 '21 17:03 AdeleHardie

Ah brilliant, thanks for the clarification, I thought I was missing something obvious ;-) I had also assumed that all atoms in the target PDB that appear in the system were used for the RMSD, rather than using a subset of the atoms using the "RMSD residue range". (Looking again at your code, this is obvious.)

I think both of the assumptions that you suggest are perfectly sensible and would massively simplify the problem. If we are clear with the expectations for the target PDB file, then it will likely save a lot of headaches. The only reason for trying to detect the target molecule was in case the system had been re-ordered since it was originally loaded. I am imagining a situation where the user might load a single molecule, then solvate it, do some other operations, delete things, recombine things, etc. If we make the assumption that all residues in the target PDB are in the system, then this would simplify the search too. (I imagine we could just take the molecule with the closest number of residues, which would likely work in the majority of cases.) We could provide an option to pass the index, which always takes precedence. (For example, this is what I do with the funnel metadyamics code when detecting the ligand.)

With regards to the "RMSD residue range": Is this always contiguous, or are there cases were you might want multiple ranges, or the ability to pass specific indices rather than a range, or a combination of indices and ranges? This would complicate the atom matching a little, but it could be broken up into stages for each contiguous chunk.

I'll rework things on Monday.

lohedges avatar Mar 19 '21 18:03 lohedges

In my experience the RMSD residues have always been continuous, but I'm not sure if that would always be the case. Most other libraries I have used let you give specific indices and/or have some sort of selection language. I think anywhere where the topology is easily accessible, passing indices can give a lot of freedom and makes your work easier. That would also allow for modifications such as using backbone or heavy atoms only, which is common in RMSD calculations.

AdeleHardie avatar Mar 22 '21 14:03 AdeleHardie

Sorry for the slow update on this. I realised the steered-MD is generic to any collective variable supported by PLUMED, with targeted MD (using RMSD as the target) being a special case. I'm just working on making things general so that we can run steered-MD simulations with any of the collective variables that we currently support (besides the funnel).

Just looking at your example files above (from plumed_files.zip) I see that all of the occupancy values are 1.00, but the beta values are either 0.00 or some other floating point number, i.e. not just 0.00 or 1.00. For example:

HETATM    6  O   ACE A   1      74.956  14.219  22.683  1.00  0.00           O
HETATM    5  C   ACE A   1      74.425  13.962  21.610  1.00  0.00           C
HETATM    2  CH3 ACE A   1      75.217  13.607  20.377  1.00  0.00           C
ATOM      7  N   MET A   2      73.102  13.935  21.373  1.00  0.00           N
ATOM      9  CA  MET A   2      72.154  14.306  22.433  1.00  0.00           C
ATOM     22  C   MET A   2      71.075  15.176  21.826  1.00  0.00           C
ATOM     23  O   MET A   2      71.031  15.395  20.626  1.00  0.00           O
ATOM     11  CB  MET A   2      71.579  13.006  23.035  1.00  0.00           C
ATOM     14  CG  MET A   2      72.692  12.144  23.654  1.00  0.00           C
ATOM     17  SD  MET A   2      71.976  10.681  24.447  1.00  0.00           S
ATOM     18  CE  MET A   2      72.318  11.053  26.186  1.00  0.00           C
ATOM     24  N   GLU A   3      70.209  15.654  22.735  1.00 55.62           N
ATOM     26  CA  GLU A   3      69.400  16.881  22.481  1.00 55.73           C
ATOM     37  C   GLU A   3      68.743  16.815  21.109  1.00 54.13           C
ATOM     38  O   GLU A   3      69.379  16.458  20.115  1.00 53.13           O
ATOM     28  CB  GLU A   3      70.285  18.127  22.559  1.00 57.90           C
ATOM     31  CG  GLU A   3      69.546  19.445  22.375  1.00 61.02           C
ATOM     34  CD  GLU A   3      68.487  19.673  23.436  1.00 63.99           C
ATOM     35  OE1 GLU A   3      68.782  19.448  24.628  1.00 65.95           O
ATOM     36  OE2 GLU A   3      67.361  20.084  23.080  1.00 65.20           O
ATOM     39  N   MET A   4      67.462  17.156  21.064  1.00 51.17           N
ATOM     41  CA  MET A   4      66.717  17.149  19.818  1.00 48.94           C
ATOM     54  C   MET A   4      67.270  18.240  18.907  1.00 47.90           C
ATOM     55  O   MET A   4      67.182  18.145  17.684  1.00 45.73           O
...

Has something gone wrong here, or am I missing something? (I know that the two columns just represent weights, so can be any number, but your code and the description above suggests that they will always be 0 or 1.)

lohedges avatar Mar 31 '21 09:03 lohedges

You are looking at renumbered.pdb which only has adjustments made to the atom index. The actual file used by PLUMED (should be in the PLUMED example file) is reference.pdb, where the beta and occupancy columns are all 0.00 or 1.00. The numbers in renumbered.pdb are whatever the original pdb had there. I have this in two steps because I was playing around with which residues to use for RMSD and so I just renumbered it once.

AdeleHardie avatar Mar 31 '21 10:03 AdeleHardie

Ah, perfect. I actually thought I had opened the reference file but my terminal autocomplete must have filled in renumbered instead when I hit tab. Bah :facepalm:

lohedges avatar Mar 31 '21 10:03 lohedges

I've pushed some changes to the feature_steered_md branch. Here I've implemented an RMSD collective variable that can be used for metadynamics simulations as well as steered MD, although I've yet to write the driver code for steered MD. You can pass in a system and reference molecule to the RMSD constructor and it will do the work of generating the reference PDB file for you, e.g.

import BioSimSpace as BSS

# Load a (boring) example system.
system = BSS.IO.readMolecules("amber/ala/*")

# Extract a molecule and translate it for fun.
reference = system[0]
reference.translate(3*[BSS.Units.Length.angstrom])

# Create a RMSD collective variable. Since we don't pass an index for the reference molecule the code
# chooses the molecule in the system with the closest number of residues. Here we've not specified any
# indices for the atoms involved in the RMSD calculation, so all are used. If the user passes a list of
# 'rmsd_indices' then any matching atoms that aren't in this list are used for alignment. We require that
# all of the atoms in the reference are present in the system and that the ordering of residues is the
# same.
rmsd = BSS.Metadynamics.CollectiveVariable.RMSD(system, reference)

# Print the list of PDB strings. These are written to 'reference.pdb' in the working directory when setting up
# a process for metadynamics or steered MD.
print(rmsd.getReferencePDB())

I'm now working on implementing the required moving restraint code for PLUMED. For now I'll probably limit this to a subset of the collective variables that we currently support. (Probably distance and RMSD.)

lohedges avatar Apr 12 '21 13:04 lohedges

Let me know if the constraints are too stringent, e.g. if we want to be able to handle reference molecules that have atoms that aren't in the system. (This is the case for your original example.) It would just be nice to have a way of knowing that we've obtained a good match without the user needing to look at the PDB file, especially for a large molecule. Perhaps we could add a flag to allow a mismatch in the atom number if the user knows that this is the case, e.g. when modelling a truncated version of a molecule.

lohedges avatar Apr 12 '21 14:04 lohedges

I'm also not sure whether it's best to specify the atoms that are involved in the RMSD calculation by index, or to use the residue index as you have done. The atom index seems more flexible, but possibly more work for the user. The rmsd_indices list is in refers to the reference molecule, so the user doesn't need to work out the actual index within the matching molecule in the system.

lohedges avatar Apr 12 '21 14:04 lohedges

Also, do we need any special handling for the case were all atoms in the reference are used for the RMSD, e.g. should both the occupancy and beta values be set to 1? Your examples always match a subset of the atoms, where we have 0/1 for the RMSD atoms, and 1/0 for the others. (The example in the PLUMED docs just uses 1 for both sets of weights.)

lohedges avatar Apr 12 '21 14:04 lohedges

About specifying atoms vs whole residues - I think atom indices are the way to go. In my case I use all heavy atoms and the reference PDB does not have Hs so it's not as apparent, but I believe it's common to not use all atoms in a residue to calculate RMSD (e.g. only backbone atoms).

To handle the cases where all atoms are used for RMSD (and possibly allow more flexibility) the user could pass two lists, one with occupancy and one with beta values? It is possible users may want to both use atoms for alignment and for RMSD calculation, such as when they're using a large chunk of the protein for RMSD and the leftover atoms don't give good alignment. I'm not sure how common that would be though, so there could of course be an option to use 'all' indices that handles this specific case. I doubt there would be use for whole protein RMSD during steered MD, but it might be useful for post run analysis (if it's possible to use CVs for that, I haven't looked closely?)

I don't think having to have all reference atoms in the system is that constraining or difficult, but I guess allowing 'unmatched' atoms to be ignored would give it more flexibility?

AdeleHardie avatar Apr 12 '21 15:04 AdeleHardie

I've now created a BioSimSpace.Protocol.Steering class that can be used to set up a schedule for performing steered MD. I just need to convert this to the correct PLUMED file to implement the moving restraint, as well as making sure that I don't break anything related to our existing PLUMED metadynamics functionality.

lohedges avatar Apr 15 '21 15:04 lohedges

For simplicity I've decided to leave the RMSD CV as is for now. It should work for the purposes of the steered MD (assuming that inputs are edited so that the reference doesn't contain any atoms that aren't in the system). I'll think about making it more flexible after the workshop, i.e. allowing the user to customise the RMSD and alignment weights on a per-atom basis.

lohedges avatar Apr 15 '21 15:04 lohedges

With regards to output from PLUMED when running steered MD: It looks like you can just print exactly the same information as you would when running metadynamics, i.e. the time series values of the collective variables and the associated bias. If so, I should be able to re-use the existing code for getting this information dynamically from a running process.

lohedges avatar Apr 15 '21 15:04 lohedges

I've now got steered MD working with GROMACS in the feature_steered_md branch. I'll test it with your example input (I'll tweak the reference file to only include the atoms in the system) then post an example that you can try locally once I've merged.

lohedges avatar Apr 16 '21 10:04 lohedges

Does the RMSD CV require that the reference have the same number of residues and non H atoms as the system molecule it is being aligned to? When I run the following:

import BioSimSpace as BSS
system = BSS.IO.readMolecules(['data/system.prm7', 'data/system.rst7'])
reference = BSS.IO.readMolecules('data/reference.pdb').getMolecule(0)

#get loop indices for rmsd
rmsd_indices = []
for residue in reference.getResidues():
    if 178<=residue.index()<=184:
        for atom in residue.getAtoms():
            rmsd_indices.append(atom.index())

rmsd_cv = BSS.Metadynamics.CollectiveVariable.RMSD(system, reference, 0, rmsd_indices)

I get the following error: Exception 'SireMol::missing_residue' thrown by the thread 'master'. There is no residue with the number "284" in the layout "{bcad4c83-4b71-4557-b057-5afa1ba6a277}".

The difference between the reference PDB and the protein in the system is that reference PDB does not have H atoms and has 283 residues while the system has 284. However, when I add a line with a dummy atom to make up a 284th residue, and rerun the same code, I get: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2307, expected 2308.

Then I added another dummy atom to the 284th residue of the reference, but get the following: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2307, expected 2309.

Changing one of the atom names in residue 284 of reference PDB to match one of the atom names in residue 284 of the system changes the error to: IncompatibleError: Didn't match all of the atoms in the reference molecule: Found 2308, expected 2309.

So it looks like at the moment it is expected that the molecules match up perfectly. I thought the CV would be able to handle differences in molecules as long as all the atoms in the reference were present in the system (which was the case in the first scenario). Could you clarify what the requirements are for the reference/system match for the CV?

(for reference, I'm using a version of BSS cloned from the feature_steered_md branch around noon today)

AdeleHardie avatar Apr 16 '21 12:04 AdeleHardie

The expectation is:

  1. The reference can be matched to a single molecule in the system. (If the user passes an index, then this is clearly unambiguous.)
  2. All atoms in the reference occur within the matching molecule in the system and have the same name.
  3. Residues in the reference have the same numbering as those in the matching molecule in the system.

The above handles cases where the molecule has more atoms than the reference, but not cases where the reference does. If we need to support that, then I can remove the check. I guess we can raise a warning, since there might be situations where this indicates that something has gone wrong, or the reference / system is incorrect.

As a quick test, can you remove lines 164-166 of _rmsd.py and see if the output looks sane?

Are these the same example files as previously, if so, I can test here too.

lohedges avatar Apr 16 '21 13:04 lohedges

Atoms are matched by residue number and name, so the ordering within a residue can be different, or atoms missing from within a residue, i.e. heavy atoms only.

lohedges avatar Apr 16 '21 13:04 lohedges

The system is the same, but the target.pdb file has only residues 1-283 (simple deleted the rest), which are the ones present in the system.

This makes me think that something goes wrong when the residue number does not match between the two. I can see why I'm getting the IncompatibleError when I add atoms that aren't present in the system, but then the first case (where reference only had 283 residues) should have worked?

AdeleHardie avatar Apr 16 '21 13:04 AdeleHardie