Bug in hydrogen filtering during distance geometry module
Hi there!
I found a small bug that causes the PoseBusters.bust function to break; I think its an edge case that occurs if you have odd (e.g., de novo designed) molecules.
Attached is an SDF of a molecule that is a valid molecule (in that it can be sanitized by rdkit), but at least on my machine, reliably breaks posebusters. Here is a snippet to reproduce:
import posebusters as pb
from rdkit import Chem
bad_mol = Chem.MolFromMolFile('badmol.sdf')
buster = pb.PoseBusters(config='mol', max_workers=0)
pb_result = buster.bust([bad_mol], None, None)
Produces the following:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[5], line 2
1 buster = pb.PoseBusters(config='mol', max_workers=0)
----> 2 pb_result = buster.bust([bad_mol], None, None)
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:132, in PoseBusters.bust(self, mol_pred, mol_true, mol_cond, full_report)
130 self.file_paths = pd.DataFrame([[mol_pred, mol_true, mol_cond] for mol_pred in mol_pred_list], columns=columns)
131 generator = self._run()
--> 132 results = self._collect_in_table(generator, full_report=full_report)
133 return results
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:295, in PoseBusters._collect_in_table(self, results_gen, full_report)
292 def _collect_in_table(self, results_gen, full_report) -> pd.DataFrame:
293 """Collect generator results in a pandas dataframe."""
--> 295 df = pd.concat([self._make_table({k: v}, self.config, full_report=full_report) for k, v in results_gen])
296 df.index.names = ["file", "molecule", "position"]
297 df.columns = [c.lower().replace(" ", "_") for c in df.columns]
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:295, in <listcomp>(.0)
292 def _collect_in_table(self, results_gen, full_report) -> pd.DataFrame:
293 """Collect generator results in a pandas dataframe."""
--> 295 df = pd.concat([self._make_table({k: v}, self.config, full_report=full_report) for k, v in results_gen])
296 df.index.names = ["file", "molecule", "position"]
297 df.columns = [c.lower().replace(" ", "_") for c in df.columns]
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:160, in PoseBusters._run(self)
158 chunk_size = self.config.get("chunk_size", 100)
159 if max_workers is not None and max_workers <= 0:
--> 160 yield from self._run_single_thread()
161 elif chunk_size is None:
162 yield from self._run_parallel_over_files(max_workers=max_workers)
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:168, in PoseBusters._run_single_thread(self)
166 def _run_single_thread(self) -> Generator[ResultTuple, None, None]:
167 for _, paths in self.file_paths.iterrows():
--> 168 yield from self._run_multiple_poses(paths)
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:243, in PoseBusters._run_multiple_poses(self, paths, indices)
240 mol_args["mol_pred"] = mol_pred
242 key: ResultKey = (str(paths["mol_pred"]), self._get_name(mol_pred), i)
--> 243 results: ResultList = self._run_one_pose(mol_args)
245 yield key, results
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/posebusters.py:262, in PoseBusters._run_one_pose(self, molecules)
260 module_output: dict[str, Any] = {"results": {}}
261 else:
--> 262 module_output = func(**args_needed)
264 # save to object
265 results.extend([(name, k, v) for k, v in module_output["results"].items()])
File ~/mambaforge/envs/mol-fm/lib/python3.10/site-packages/posebusters/modules/distance_geometry.py:199, in check_geometry(mol_pred, threshold_bad_bond_length, threshold_clash, threshold_bad_angle, bound_matrix_params, ignore_hydrogens, sanitize, symmetrize_conjugated_terminal_groups)
196 df_12["distance"] = conf_distances[lower_triangle_idcs]
198 if ignore_hydrogens:
--> 199 df_12 = df_12.loc[~df_12["has_hydrogen"], :]
201 # calculate violations
202 df_bonds = _bond_check(df_12)
***
stack trace extends into the bowels on pandas
***
KeyError: "None of [Index([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,\n ...\n -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],\n dtype='object', length=253)] are in the [index]"
It seems we try to index into df_12 using an inversion of its has_hydrogen column. The issue is that when the has_hydrogen column is created, the datatype is not explicitly set, and must be inferred by pandas. In this edge case, pandas sets the column to an object dtype rather than bool, and so the inversion produces negative numbers, which then breaks indexing downstream.
A simple fix is to avoid using the inversion operator here.
if ignore_hydrogens:
df_12 = df_12.loc[df_12["has_hydrogen"]==False, :]
I can submit a PR shortly.