mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

potential performance improvements for H-bond analysis 'between'

Open orbeckst opened this issue 2 years ago • 12 comments

Is your feature request related to a problem?

In the review of PR #4092 , @richardjgowers observed that there's an opportunity for improving the performance for the "between" keyword of HydrogenBondAnalysis https://github.com/MDAnalysis/mdanalysis/pull/4092/#discussion_r1152956537 , namely to filter atomgroups for "between" before capped_distances() is called to calculate distances.

Describe the solution you'd like

Follow up on the suggestion and implement the suggested improvement and benchmark the old code vs the updated one.

Describe alternatives you've considered

Do nothing — that's fine, the update is not mission-critical.

Additional context

Discussion https://github.com/MDAnalysis/mdanalysis/pull/4092/#discussion_r1152956537 around https://github.com/MDAnalysis/mdanalysis/blob/3ebd5ecfa097f296c493d5b75fdab280fe1510c8/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py#L735

  • @richardjgowers : [...] would it make more sense to filter before the call to capped_distances? https://github.com/MDAnalysis/mdanalysis/blob/3ebd5ecfa097f296c493d5b75fdab280fe1510c8/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py#L712

  • @p-j-smith : good point - that would definitely be better, but it's probably not a straightforward change and I imagine there are lots of ways to do it. One way would be to iterate over the pairs of atom selections passed to between and calculate donor-acceptor distances for each pair. So e.g. if between=[["protein", "SOL"], ["protein", "protein"]], we would calculate:

    • protein donor to SOL acceptor distances
    • protein acceptor to SOL donor distances
    • protein donor to protein acceptor distances

    and concatenate them.

orbeckst avatar Apr 27 '23 16:04 orbeckst

Hello,

I'd be happy to try and work on a PR for this if it is okay? I'm not a GSOC person though, just keen to start contributing to some open source software.

Assuming yes, my understanding of the inefficiency is:

Inside the function: _single_frame(self)

https://github.com/MDAnalysis/mdanalysis/blob/e46c5ef66ee1dab6228de8d859e69658d4ae59c9/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py#L742

If a user has provided a selection to the input between, then this is where the filtering occurs.

However, before this filtering, some calculations are run on atoms that will later be filtered.

https://github.com/MDAnalysis/mdanalysis/blob/e46c5ef66ee1dab6228de8d859e69658d4ae59c9/package/MDAnalysis/analysis/hydrogenbonds/hbond_analysis.py#L719

The solution that I think you would like to see is:

  • Move the filtering with the self.between_ags to an earlier step in the process and before capped_distances. I think it could be moved to the function _prepare(self) but I'm not sure yet. Taking care that (as written above) the hydrogen bonding analysis parameter between can be a nested list.
  • Perform benchmarking to compare the speed. For the benchmarking I was thinking to test the speed of the code before and after using MDanalysis test files. For both the before and after system I would test using all atoms (to confirm the code does not take any longer), and with a limited selection, to make sure the code actually ends up being faster.

I have had a look at your documentation on how to contribute, and its very clear so thank you for that.

In summary, is it okay for me to start working on this and if so have I've understood the issue correctly? If so, I'll start working on a solution and a PR.

Thanks! Rory

RMCrean avatar May 11 '23 15:05 RMCrean

@RMCrean you're more than welcome to work on the issue, outside of GSOC. The tag just means that we think it's suitable for potential GSOC candidates.

I think your understanding of the problem is generally correct but @p-j-smith is the expert here.

I don't think that code in _prepare() will always solve the problem because that method is only called once at the start. However, it is possible that contents of atomgroups change between steps (when using UpdatingAtomGroup) and so one may have to evaluate at every step.

Please put up a PR! (EDIT: and please ensure that the PR mentions this issue so that the PR gets auto-linked with the issue)

orbeckst avatar May 12 '23 18:05 orbeckst

Great, thanks. I'll start working on this later today then!

Thanks for the point about _prepare() and UpdatingAtomGroup, I'll keep it in mind when working on the problem.

RMCrean avatar May 15 '23 10:05 RMCrean

Hi again, I've started working on this now and have made some progress and feel confident I understand how the calculation works now but I am not ready for a PR yet. I thought I could give a brief update though.

In order to test if the new implementation will work correctly and how it compares timing wise to the current implementation I've made a copy of the hbond_analysis.py file (which won't be edited or committed).

I've then setup a small test trajectory that has both protein and water (the first 10 frames from MDAnalysisData.datasets.fetch_ifabp_water() ).

Then I run though a number of different types of "between" commands using either the old implementation or the new implementation (once made) of hbond_analysis.py and analyse how long it takes on average to perform the calculation. I will also check the old and new forms of the calculation give the same results by comparing the number of hydrogen bonds found as well as checking the hbonds.results.hbonds are identical for the same between command used.

These are results for the original version of hbond_analysis.py: image

It seems at least promising that everything roughly takes the same amount of time (~11 seconds) regardless of the type of between selection used.

Does this sound like a reasonable way to do a comparison?

(Also happy to share code used etc.. to do this part even if of course it won't be part of the PR)

RMCrean avatar May 17 '23 13:05 RMCrean

Sounds like a decent start. I suggest you just put up a PR with what you have. You can mark it as draft if you like but it's generally a lot easier to discuss when code is visible.

For doing performance comparison, I would create two virtual environments (conda create -n mda_stable python=3.11 mdanalysis mdanalysisdata and conda create -n mda_dev python=3.11 ...). In the first one, you install the latest release so you have the standard version. In the second env, you install all pre-requisites and then install your development version of MDAnalysis with pip install -e package/ (i.e., a developer installation). You can then run your benchmarks in the two different environments mda_stable and mda_dev.

You can paste the code that you use for comparison into the comments of the PR.

We also have ASV benchmarking code in benchmarks/benchmarks that are run nightly. These benchmarks are published at https://www.mdanalysis.org/benchmarks/ and it would be really great to have some for hydrogen bond analysis.

orbeckst avatar May 17 '23 21:05 orbeckst

I want to work on this if no one is working right now. @orbeckst

Sumit112192 avatar Jan 05 '24 10:01 Sumit112192

So, here is what I have found. I created a file check.py with the code

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, GRO, XTC

from MDAnalysis.analysis.hydrogenbonds import HydrogenBondAnalysis as HBA


u = mda.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.run()

I ran cProfile to check which part was relatively taking the most cumulative time. Here is the result image

The _get_dh_pairs function takes the most time in _single_frame. Also, in _get_dh_pairs, the line number 637

donors = sum(h.bonded_atoms[0] for h in hydrogens) if hydrogens \                 
        else AtomGroup([], self.u)

takes the most time, enumerating all possible donors in the found hydrogens.

To leverage the between keyword, we could set self.donors_sel to the value passed in the between keyword, which would avoid the enumeration and might make the program run fast.

Please let me know if my understanding is correct so I can work on it further. @orbeckst

Sumit112192 avatar Jan 05 '24 11:01 Sumit112192

Without directly looking at the code I don't know if this will solve the problem. Try it out and open a PR. Check that your changes don't break the tests. Benchmark. Once we can look at code, it's a lot easier to follow your logic.

orbeckst avatar Jan 05 '24 20:01 orbeckst