sire somd-freenrg crashes when ligand atoms are completely non-interacting

Describe the bug I've recently been trying to implement multiple distance restraints for absolute binding free energy calculations in BioSimSpace and SOMD (modified BSS, modified sire, and an example of how to run a complete calculation). The idea is that the ligand intermolecular interactions are removed with many distance restraints active, then there's a final stage where all but one of the restraints are released. The correction for releasing the ligand to the standard state volume can then easily be calculated.

During the final release stage, where the ligand is completely non-interacting I've been getting crashes. The config file I was using with my modified version of sire was:

save coordinates = True
ncycles = 50
nmoves = 25000
ncycles_per_snap = 4
buffered coordinates frequency = 500
timestep = 4 * femtosecond
reaction field dielectric = 78.4
cutoff type = cutoffperiodic
cutoff distance = 12 * angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
constraint = hbonds-notperturbed
energy frequency = 100
lambda array = (0.0, 0.125, 0.25, 0.375, 0.5, 1.0)
lambda_val = 0.0
perturbed residue number = 293
random seed = -1
gpu = 0
center solute = False
minimise = True
hydrogen mass repartitioning factor = 3
use distance restraints = True
use permanent distance restraints = True
distance restraints dictionary = {(1474, 4687): (4.432198728689354, 20.0, 0.5557321930560404), (1472, 4686): (3.164351608202915, 20.0, 0.6751681461642662), (1474, 4690): (4.700974930995326, 20.0, 0.6701990692583388), (1488, 4688): (6.7858681626072315, 20.0, 0.7488970818377805), (2444, 4677): (4.863475388033847, 20.0, 0.7671705708955612), (1472, 4691): (6.367026275053535, 20.0, 0.821252569114769), (2261, 4683): (6.773986513796716, 20.0, 0.8828616026118992), (1490, 4685): (7.720582022157998, 20.0, 0.9347068069890287), (286, 4674): (3.8127377709679453, 20.0, 0.9511268213071902), (374, 4680): (6.181106731821419, 20.0, 0.9637979205880889), (1521, 4692): (3.7598878700775633, 20.0, 0.9830902561404447), (1529, 4684): (7.87209938990775, 20.0, 1.0582870196044247), (374, 4693): (4.975946331072206, 20.0, 1.063773425343931), (346, 4676): (8.470528418491696, 20.0, 1.0492604125738865), (284, 4675): (4.725973194492003, 20.0, 1.1798376555744614), (390, 4682): (8.350568045090748, 20.0, 1.2118321370763052), (2436, 4679): (4.32826597759919, 20.0, 1.2565532487514752), (307, 4678): (7.55090821416431, 20.0, 1.3127014576568437), (1421, 4681): (8.80522670899781, 20.0, 1.3688291245090927), (259, 4673): (9.195983552051707, 20.0, 2.1342422686136686)}
permanent distance restraints dictionary = {(1474, 4689): (3.875183995573175, 20.0, 0.5185270232355128)}
turn on receptor-ligand restraints mode = True

And the pert file just mapped dummy atoms to dummy atoms, e.g.:

version 1
molecule LIG
    atom
        name           C
        initial_type   du
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
    atom
        name           C1
        initial_type   du
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
...

I still observed crashes at random values of lambda if I dropped the timestep to 1 fs and if I switched to Langevin dynamics.

To check if this was due to changes I had made to sire or the restraints, I ran some sets of calculations with Sire 2023.4.0 (a073701d68cd61f9be56e360be3b3ee3f812bbba/GitHub/2023-08-07T17:45:41+01:00 | UNCLEAN) but with ligand still non-interacting and with no restraints. Please see the input files. For each test I ran four repeats of each window. I got two crashes at lambda = 0 (Nan) and all simulations at lambda =1 resulted in Segmentation fault (core dumped) somd-freenrg -C somd.cfg -p CUDA -m somd.pert -c somd.rst7 -t somd.prm7.

Have I chosen some configuration parameters poorly, or should this not be happening?

Also, when I watch the trajectories back, the ligand makes large jumps after every 50th frame. There are 600 frames, and 600 / 50 =12 and ncycles / ncycles_per_snap = 12.5, so I assume this is an artifact of how the trajectories are written?

To Reproduce Please see the input files. Run run.sh. with sire 2023.4.0. (I observed two crashes in four repeats of the stage).

Expected behavior Stable MD.

(please complete the following information):

OS: Ubuntu 22.04
Version of Python: 3.10.12
Version of sire: 2023.4.0 (a073701d68cd61f9be56e360be3b3ee3f812bbba/GitHub/2023-08-07T17:45:41+01:00 | UNCLEAN)
I confirm that I have checked this bug still exists in the latest released version of sire: [no] (As I would rather avoid rerunning too many simulations unless there have been very recent changes which are likely to affect this)

Thank you.

Aug 25 '23 09:08 fjclark

hi @fjclark - do you also observe crashes with MIF inputs from the MDR paper repo (that presumably did not cause runtime errors with earlier versions of sire used for the MDR paper) ?

Aug 25 '23 10:08 jmichel80

Can confirm that I see the same segmentation fault at lambda = 1. Will see if I can trace it.

Aug 25 '23 10:08 lohedges

Just to note that I don't see a mapping from dummy ambertype to dummy ambertype in your example pert files, but this doesn't seem to be the source of the problem:

version 1
molecule LIG
    atom
        name           C
        initial_type   C3
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
    atom
        name           C1
        initial_type   C3
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
...

Aug 25 '23 10:08 lohedges

hi @fjclark - do you also observe crashes with MIF inputs from the MDR paper repo (that presumably did not cause runtime errors with earlier versions of sire used for the MDR paper) ?

Hi @jmichel80 I'll check now (yes I did not observe these crashes previously).

Aug 25 '23 10:08 fjclark

For lambda = 1, I can confirm that the segfault occurs on this line of OpenMMMD.py. I'll add some debugging statements to the C++ code to see where it's blowing up. (Obviously this bit works at lambda = 0, so NaNs there might be completely unrelated.)

Aug 25 '23 10:08 lohedges

Just to note that I don't see a mapping from dummy ambertype to dummy ambertype in your example pert files, but this doesn't seem to be the source of the problem:

version 1
molecule LIG
    atom
        name           C
        initial_type   C3
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
    atom
        name           C1
        initial_type   C3
        final_type     du
        initial_LJ     0.00000 0.00000
        final_LJ       0.00000 0.00000
        initial_charge 0.00000
        final_charge   0.00000
    endatom
...

Sorry, I should have explained - this is because the standard version of sire doesn't allow atoms to start and end as dummy atoms, so I've had to modify the pert file. For my modified version of sire, this is allowed (https://github.com/OpenBioSim/sire/commit/45135642ec94caa2b611971a4d948bda2b15cd42) and the pert file is as shown in my original post.

Aug 25 '23 10:08 fjclark

Also, the pert files are missing an endmolecule record at the end, but again this doesn't seem to affect the segfault. Looking at the output of the parser it seems to get to the initial_charge of the final atom, then crashes:

...
"name O"
"initial_type O3"
"final_type du"
"initial_LJ 3.03981 0.21021"
"final_LJ 0.00000 0.00000"
"initial_charge -0.00000"
"final_charge -0.00000"
"endatom"
"atom"
"name O1"
"initial_type O3"
"final_type du"
"initial_LJ 0.00000 0.00000"
"final_LJ 0.00000 0.00000"
"initial_charge0.00000"
[1]    110167 segmentation fault (core dumped)  somd-freenrg -C somd.cfg -t somd.prm7 -c somd.rst7 -m somd.pert -p CUDA

Aug 25 '23 10:08 lohedges

Ah, I see, there is no space in initial_charge0.00000 for the final record. Printing the line and the words (the records after the line is split) makes it clear what's going wrong. Not sure why I didn't spot this above :man_facepalming:

("atom")
"name O"
("name", "O")
"initial_type O3"
("initial_type", "O3")
"final_type du"
("final_type", "du")
"initial_LJ 3.03981 0.21021"
("initial_LJ", "3.03981", "0.21021")
"final_LJ 0.00000 0.00000"
("final_LJ", "0.00000", "0.00000")
"initial_charge -0.00000"
("initial_charge", "-0.00000")
"final_charge -0.00000"
("final_charge", "-0.00000")
"endatom"
("endatom")
"atom"
("atom")
"name O1"
("name", "O1")
"initial_type O3"
("initial_type", "O3")
"final_type du"
("final_type", "du")
"initial_LJ 0.00000 0.00000"
("initial_LJ", "0.00000", "0.00000")
"final_LJ 0.00000 0.00000"
("final_LJ", "0.00000", "0.00000")
"initial_charge0.00000"
("initial_charge0.00000")
[1]    114032 segmentation fault (core dumped)  somd-freenrg -C somd.cfg -t somd.prm7 -c somd.rst7 -m somd.pert -p CUDA

After fixing this, the lambda = 1 window runs. Not sure if will NaN at a later date, though.

Aug 25 '23 10:08 lohedges

Sorry about this - it looks like I've messed up editing a few of the pert files so that initial LJ terms remain for O. The pert file does seem to be correct for lambda = 0 though, where the Nans occur.

Aug 25 '23 11:08 fjclark

No problem. I'll look at that one to see if I can get it to fail.

Aug 25 '23 11:08 lohedges

No crashes yet, but I can confirm that I see the ligand "jump" after 50 frames. It almost looks like everything is slightly rotated within the box. It's visible for the protein and ions too, but harder to tell when waters are included. Not sure if this is some buffering error, i.e. there is a larger jump in time between frames 49-50, 99-100, ..., than any other pair of frames, or whether something else is going on.

Aug 25 '23 11:08 lohedges

It seems like this is due to the use of OFF vs GAFF. When I reparameterise the ligand with GAFF2, keeping the initial coordinates identical, I haven't yet observed any crashes. I think the original ligand FF is OFF2 (set up by a collaborator - I have asked for confirmation). Tests run:

Applying intermolecular restraints to the decoupled ligand (requires my modified version of Sire - happy to supply inputs if needed). OFF system: 2/ 6 simulations crash with Nan. GAFF2 system: 0 / 18 simulations crash. Using constraint=hbonds-notperturbed.
Setting constraint=allbonds. OFF system: immediately crashes in all cases. GAFF2 system: runs as expected in all cases.

The input systems are here.

This is another example of the issue discussed here, where constraint=allbonds causes OFF systems to crash.

Thanks.

Sep 15 '23 14:09 fjclark

Is this still an open issue? I'm doing some spring cleaning ;-)

Oct 25 '23 18:10 chryswoods

I'm afraid this is still an open issue - using GAFF instead of any OFF seems to improve things substantially, but with more tests I still get the occasional crash with a fully decoupled ligand. I would also prefer to use OFF2 with constraint=allbonds, but this always causes crashes.

Oct 26 '23 12:10 fjclark

Would it be worth trying this in somd2? If you don't see the crash then we can debug where in somd things are going wrong. And if you still see the crash then this is definitely something we need to fix in somd2 :-)

Feb 08 '24 19:02 chryswoods

Yes, definitely. As discussed, I'll have a shot in the next few weeks.

Feb 12 '24 09:02 fjclark

I've tried some similar tests with SOMD2:

Overall Set-Up

Systems: Protein-ligand complexes where the ligand is parameterised with GAFF2 or OFF2, taken from the comment above.
All input/scripts/setup: gh_issue_test_off_gaff_stability.tar.gz The scripts are almost the same as used in the somd2 ABFE issue . Only differences are:
- Most examples don't use any restraints, so can be run with standard SOMD2. The tests with restraints require my modified SOMD2 branch.
- I only use two windows, as I was looking for instabilities at the end state, and wanted a sanity check that things were still stable at lambda =0.

Software

BioSimSpace: 2024.1.0.dev Sire: 2024.1.0.dev SOMD2: my modified SOMD2 branch, which has only very minor changes from SOMD2 main, which I merged in this morning before installing. python: 3.11.8 OS: Ubuntu 22.04

Tests

GAFF2

Sanity check with many windows

Carried out once to check that we got a free energy change of roughly the expected size:

barostat_frequency: 25
charge_scale_factor: 0.2
checkpoint_frequency: 100 ps
com_reset_frequency: 10
constraint: h_bonds
coulomb_power: 0.0
cutoff: "12 \xC5"
cutoff_type: rf
dynamic_constraints: true
energy_frequency: 10 ps
equilibration_time: '0 '
equilibration_timestep: 0.001 ps
frame_frequency: 20 ps
h_mass_factor: 1.5
include_constrained_energies: false
integrator: langevin_middle
lambda_schedule: "LambdaSchedule(\n  morph: (-\u03BB + 1) * initial + \u03BB * final\n\
  \    restraint: 1\n)"
log_file: somd2.log
log_level: info
max_gpus: 1
max_threads: 20
minimise: true
num_lambda: 34
output_directory: output
overwrite: false
pert_file: null
**perturbable_constraint: bonds_not_heavy_perturbed**
platform: cuda
pressure: 1 atm
restart: false
**restraints: null**
run_parallel: true
**runtime: 100 ps**
save_trajectories: true
save_velocities: false
shift_delta: "2 \xC5"
somd1_compatibility: false
swap_end_states: false
temperature: 298 K
**timestep: 0.004 ps**

No crashes observed
Free energy change of roughly expected magnitude (~ 340 kcal / mol).

"Standard" Settings

10 repeats with two windows each:

barostat_frequency: 25
charge_scale_factor: 0.2
checkpoint_frequency: 100 ps
com_reset_frequency: 10
constraint: h_bonds
coulomb_power: 0.0
cutoff: "12 \xC5"
cutoff_type: rf
dynamic_constraints: true
energy_frequency: 10 ps
equilibration_time: '0 '
equilibration_timestep: 0.001 ps
frame_frequency: 20 ps
h_mass_factor: 1.5
include_constrained_energies: false
integrator: langevin_middle
lambda_schedule: "LambdaSchedule(\n  morph: (-\u03BB + 1) * initial + final * \u03BB\
  \n    restraint: 1\n)"
log_file: somd2.log
log_level: info
max_gpus: 1
max_threads: 20
minimise: true
num_lambda: 2
output_directory: output_1
overwrite: false
pert_file: null
**perturbable_constraint: bonds_not_heavy_perturbed**
platform: cuda
pressure: 1 atm
restart: false
**restraints: null**
run_parallel: true
**runtime: 100 ps**
save_trajectories: true
save_velocities: false
shift_delta: "2 \xC5"
somd1_compatibility: false
swap_end_states: false
temperature: 298 K
**timestep: 0.004 ps**

NaN crash observed in 1/10 runs
- Crash observed at lambda = 1, so likely same issue as discussed in the somd2 ABFE issue where the molecule is not properly coupled to the heat bath (issue was slightly improved by using a Brownian integrator)

OFF2

"Standard" Settings

10 repeats with two windows each, same config as for GAFF2.

NaN crashes observed for all runs for all lambda windows during dynamics

With timestep of 0.5 fs

Config the same as "standard", but with a 0.5 fs timestep.

NaN crashes for all runs for all lambda windows during dynamics

With timestep of 0.5 fs and perturbable_constraint=none

Config the same as "standard", but with a 0.5 fs timestep and constraints removed via "perturbable_constraint=none".

NaN crash observed in 3/10 runs
Crashes only observed at lambda =1

Overall Points

Issues are broadly the same as SOMD1
GAFF2 simulations are generally much more stable than OFF2 simulations
GAFF2
- Fairly stable with bonds_not_heavy_perturbed
- A single crash observed when the ligand was fully decoupled, possibly the same cause as that observed for Benzene in the SOMD2 issue
OFF2
- Very unstable with bonds_not_heavy_perturbed - all dynamics crash with NaN at lambda = 0 and 1. My understanding is that bonds_not_heavy_perturbed constrains heavy bonds, so this seems to be the same issue as when we use constraint=allbonds with SOMD1
- Crashes at lambda=0 removed by using perturbable_constraint=none, but several crashes at lambda =1 persist, so still seems less stable than GAFF2.

Conclusion

Constraints on heavy bonds cause the OFF2 system to blow up at all values of lambda, but can be used on the GAFF2 system. Without constraints on heavy bonds, both systems seem stable when the ligand is fully interacting, but show crashes when the ligand is fully decoupled (and unrestrained).

Mar 19 '24 18:03 fjclark

Could you try with h_bonds_not_perturbed? I found that bonds_not_perturbed would end up constraining everything for the tyk2 set and I also couldn't run anything. (I think this was possibly due to minimising with constraints for a non-equilibrated system).

Also, it might be worth checking what constraints are applied for the different force fields, and whether things are better if you use the somd1_compatibility_mode. (Once a dynamics object is created, you can use d.get_constraints() to print what bonds are constrained.)

Mar 19 '24 18:03 lohedges

hi @fjclark - you could try using a Nose Hoover integrator and add a sub-thermostat to control the temperature of the solute separately from the rest of the system, see http://docs.openmm.org/development/api-python/generated/openmm.openmm.NoseHooverIntegrator.html

see also https://github.com/openmm/openmm/issues/2910

Mar 19 '24 23:03 jmichel80

Thanks for the suggestions.

h_bonds constraints

@lohedges:

Could you try with h_bonds_not_perturbed?

OFF2 h_bonds_not_heavy_perturbed

Same as "standard" protocol other than perturbable_constraint=h_bonds_not_heavy_perturbed and 20 repeats.

This removes the instabilities at lambda != 1. NaNs at lambda =1 observed in 2/20 runs.

GAFF2 h_bonds_not_perturbed

No NaNs observed over 20 repeats. So possibly more stable with with bonds_not_heavy_perturbed for GAFF2, but can't say for sure as we only saw one crash in that case.

OFF2 with SOMD1 Compatibility Mode

Everything blows up during every dynamics attempt as before. As expected given that this happened with SOMD1.

OFF2 with SOMD1 Compatibility Mode and h_bonds_not_perturbed

0/10 repeats crash.

Constraints Applied

There are 33 bonds in the ligand: 11 to H and 22 with only heavy atoms:

OFF2/GAFF2 with bonds_not_heavy_perturbed: All 33 bonds constrained OFF2/GAFF2 with h_bonds_not_heavy_perturbed: 11 bonds to H constrained only

So as with SOMD1, constraints between heavy atoms are unusable with OFF2 but don't seem to make the GAFF2 simulations any less stable.

Mar 20 '24 11:03 fjclark

I'm going to bump this to 2024.2.0 as we've made a lot of progress, but more is needed before we put in all the fixes for 2024.1.0

Mar 31 '24 11:03 chryswoods

I am bumping this to 2024.3.0. Could you check if the latest fixes in 2024.2.0.dev have fixed the instabilities. They have fixed instabilities in other systems.

Jun 22 '24 22:06 chryswoods

I retested the OFF2 simulations above (10 repeats, "standard" settings) with sire 2024.2.0.dev, but unfortunately all 10 simulations fail with NaN, as before.

Jun 25 '24 10:06 fjclark

Just to check. Are you still using the somd1 compatibility mode? All recent fixes are applied in that layer.

Jun 25 '24 11:06 lohedges

Thanks - I wasn't. However, they all still crash when I turn on somd1 compatibility mode. Full config:

barostat_frequency: 25
charge_scale_factor: 0.2
checkpoint_frequency: 100 ps
com_reset_frequency: 10
constraint: h_bonds
coulomb_power: 1.0
cutoff: "12 \xC5"
cutoff_type: rf
dynamic_constraints: true
energy_frequency: 10 ps
equilibration_constraints: false
equilibration_time: '0 '
equilibration_timestep: 0.001 ps
frame_frequency: 20 ps
h_mass_factor: 1.5
include_constrained_energies: false
integrator: langevin_middle
lambda_energy: null
lambda_schedule: "LambdaSchedule(\n  morph: initial * (-\u03BB + 1) + final * \u03BB\
  \n    restraint: 1\n)"
lambda_values: null
log_file: somd2.log
log_level: info
max_gpus: 1
max_threads: 20
minimise: true
num_lambda: 2
output_directory: output_1
overwrite: false
pert_file: null
perturbable_constraint: h_bonds_not_heavy_perturbed
platform: cuda
pressure: 1 atm
restart: false
restraints:
- !!python/object/apply:sire.legacy.MM._MM.BoreschRestraints
  state: !!python/tuple
  - sire_pickle_data: AAABSnheuxoUfZOBgYGxY+J8v1knHwQA2UxIbEaQnI1h6Tp0NUA+MxADAeshKH0YSh9B0g9SA1UnFAiRFwqC0sEO/Axx8QbP/yOpZwHZb//lZn3StnW19t+XrDgYcrINSZ4VZJ4DS0n7n53Kix3Ya3YqL5pTYf9t0ZzyWdmBDiIQk5HUs4HMg4giAJI8O8J96KoYGGzs2XpAoqs0n50G+R/keIYihlSGYoYSIJ3IkMmQx1ACUwUAAs1Nbw==
    sire_pickle_version: 1
run_parallel: true
runtime: 100 ps
save_energy_components: false
save_trajectories: true
save_velocities: false
shift_delta: "1 \xC5"
somd1_compatibility: true
swap_end_states: false
temperature: 298 K
timestep: 0.004 ps
write_config: true

Jun 25 '24 12:06 fjclark

No problem. Thanks for confirming.

Jun 25 '24 13:06 lohedges