Error fix: Specify "standard_names=False" when loading pdb into mdtraj
Description
This change fixes an error where, for certain host PDBs, pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations. The error is given here (some line numbers will differ):
Traceback (most recent call last):
File [...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 1208, in _execute_protocol
protocol.execute(directory, available_resources)
File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
self._execute(directory, available_resources)
File "[...]/openff-evaluator/openff/evaluator/protocols/paprika/forcefield.py", line 93, in _execute
build_protocol.execute(directory, available_resources)
File "[...]/openff-evaluator/openff/evaluator/workflow/protocols.py", line 695, in execute
self._execute(directory, available_resources)
File "[...]/openff-evaluator/openff/evaluator/protocols/forcefield.py", line 683, in _execute
topology = Topology.from_openmm(
^^^^^^^^^^^^^^^^^^^^^
File "[...]/openff/utilities/utilities.py", line 80, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/openff-toolkit/openff/toolkit/topology/topology.py", line 1495, in from_openmm
raise ValueError(msg)
ValueError: No match found for molecule C. This would be a very unusual molecule to try and parameterize, and it is likely that the data source it was read from does not contain connectivity information. If this molecule is coming from PDB, please ensure that the file contains CONECT records. The PDB format documentation (https://www.wwpdb.org/documentation/file-format-content/format33/sect10.html) states "CONECT records are mandatory for HET groups (excluding water) and for other bonds not specified in the standard residue connectivity table."
Todos
- [x] Speciy that MDTraj should not change naming when loading PDB files in
protocols/paprika/coordinates.py, to prevent an error.
Note
The file protocols/paprika/coordinates.py contains one other instance where an MDTraj trajectory is created using mdtraj.load_pdb(), located inside _atom_indices_by_role(), but I haven't encountered any errors related to that instance.
Codecov Report
Merging #552 (343f3be) into main (c4fbf16) will decrease coverage by
2.60%. Report is 2 commits behind head on main. The diff coverage is100.00%.
Additional details and impacted files
pAPRika can lose CONECT bond information during the release_apply_parameters windows of APR calculations
Could you elaborate on this a little on what's going on here (what triggers it, how often it happens, etc.)? I'm a little apprehensive about changing the default behavior of one of the main PDB loading pathways
@mattwthompson I can appreciate your apprehension. I haven't encountered any unintended consequences of making the change this PR suggests, but that doesn't mean there are provably none. This issue has only started to occur for me with PDBs describing host-guest systems unique to the "steroids" branch of the TAPROOM dataset (meaning not found in the main branch). @jeff231li mentioned the error given seemed familiar, so he may have encountered it before, but it seems to be an overall rare issue. Here is a little script showing two example PDBs, one that does not have this issue and one that does. The PDB that does have this issue, "hse_x02", can be corrected when standard_names=False is specified in mdtraj.load_pdb() (line 18).
import mdtraj
import requests
urls = [
"https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/bcd/hex/b-hex-p.pdb",
"https://raw.githubusercontent.com/slochower/host-guest-benchmarks/steroids/taproom/systems/hse/x02/b-x02-p.pdb",
]
system_names = ["bcd_hex", "hse_x02"]
def has_CONECT_issue(url, system_name):
raw_pdb = requests.get(url)
with open(f"{system_name}.pdb", "wb") as f:
f.write(raw_pdb.content)
mdtraj_system = mdtraj.load_pdb(f"{system_name}.pdb") # Problem line
# Here the mdtraj trajectories are sliced to only contain the atoms of the host molecule, identified by a list of indices.
# The list of indices determined by pAPRika is 0..146 for bcd and 0..209 for hse
if system_name == "bcd_hex":
mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 147))
elif system_name == "hse_x02":
mdtraj_system_sliced = mdtraj_system.atom_slice(range(0, 210))
mdtraj_system_sliced.save(f"{system_name}-mdtraj.pdb")
with open(f"{system_name}-mdtraj.pdb", "r") as f:
for line in f.readlines():
if "CONECT" in line:
return False
return True
for i, url in enumerate(urls):
print(
f"System {system_names[i]} has the CONECT issue: {has_CONECT_issue(url, system_names[i])}\n"
)
I ran this script with mdtraj 1.9.9 (py311h90fe790_1). Looking at the PDB files produced by running the script will allow you to directly see the difference between raw and mdtraj-processed CONECT records. I think the script highlights the issue sufficiently, but if it's unclear please let me know.
Here is a list of the major differences between the bcd_hex PDB file and the hse_x02 PDB file:
-
hse_x02has noCRYSTline at the top -
hse_x02has noTERline after the host -
hse_x02has hydrogens after heavy atoms, instead of in-between heavy atoms -
hse_x02has 1 residue for the entire complex, with two chains (one for host, one for guest) -
bcd_hexhas no specified residues, with 1 chain per host monomer and 1 chain for the guest (7+1=8 total)
My guess to why the default mdtraj loading works for bcd systems and not hse systems is that it's due to the 4th and 5th difference, but I haven't checked this rigorously. The thing that makes me think this should be less of a "fix your inputs PDBs" solution and more of a "fix the way PDBs are handled" solution is that, to my understanding, the hse PDB is valid and therefore ought to be accepted.
Sorry, I'm still a little confused - MDTraj can load those files in just fine, but gets confused when writing the CONECT records of the HSE host?
Any of the issues you see in the HSE file seem like data problems to me - why, for example, should one have a TER separating the host and guest and the others not?
Does pAPRika do alchemical transformations that muck with the bond graph?
I'm iffy on this as-is since these seem like input data issues, but if there was a try-except (try loading it as-is, and if that fails fall back to using this flag) I'd be more amenable to it.