mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Fix: modernise HBA to use AG as objects internally instead of selection strings - code only PR (Issue #3933)

Open lunamorrow opened this issue 1 year ago • 37 comments

Fixes #3933

Changes made in this Pull Request:

  • Updated guess_acceptors, guess_hydrogens, guess_donators, group_categories and get_dh_pairs to remove reliance of HydrogenBondAnalysis on selection strings being passed internally

PR Checklist

  • [x] Tests?
  • [x] Docs?
  • [x] CHANGELOG updated?
  • [x] Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4533.org.readthedocs.build/en/4533/

lunamorrow avatar Mar 27 '24 08:03 lunamorrow

Hello @lunamorrow! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 111:80: E501 line too long (111 > 79 characters) Line 113:80: E501 line too long (113 > 79 characters) Line 114:80: E501 line too long (113 > 79 characters) Line 445:52: W291 trailing whitespace Line 500:58: W291 trailing whitespace Line 520:74: W291 trailing whitespace Line 543:73: W291 trailing whitespace Line 568:9: E265 block comment should start with '# ' Line 569:70: W291 trailing whitespace Line 634:68: W291 trailing whitespace Line 635:80: E501 line too long (80 > 79 characters) Line 644:68: W291 trailing whitespace Line 645:80: E501 line too long (80 > 79 characters) Line 709:49: E128 continuation line under-indented for visual indent Line 709:80: E501 line too long (80 > 79 characters)

Comment last updated at 2024-04-26 01:03:36 UTC

pep8speaks avatar Mar 27 '24 08:03 pep8speaks

Linter Bot Results:

Hi @lunamorrow! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/8841528655/job/24278766039


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

github-actions[bot] avatar Mar 27 '24 08:03 github-actions[bot]

Just leaving some comments here for whoever is assigned to check this PR:

  1. I will be submitting a second PR to depreciate the API once I receive any feedback on this one. I want to make sure I have made the appropriate changes here, and any adjustments necessary, before I make that second PR.
  2. A block of code is commented out in _get_dh_pairs as I was unsure whether the check on line 627 had to be repeated (also in guess_donors). In #3933 I have asked about this double up, but decided to proceed with what looked like the best, most robust option. I am also unsure if there is a reason that self.donors_sel is not set in _prepare the same way that self.hydrogens_sel and self.acceptors_sel are. I have added this functionality, which should make the check in _get_dh_pairs redundant.
  3. I have run the current test cases for test_hydrogenbonds_analysis.py and have 13 failed, 26 passed, 14 warnings and 5 errors. All of the errors are because This Universe does not contain charge information, which I do not think is due to my changes? Of the failures, 6 are from Can't perform '__eq__' between objects: 'AtomGroup' and 'str', so these will require for me to change the test cases and update this PR yes? The remaining failures are due to hydrogens_ag not being a recognised attribute (I am unsure how to fix this?), 2x Assertion Error (Also unsure on best way to address this) or double ups with the charge information errors.

lunamorrow avatar Mar 27 '24 09:03 lunamorrow

Update: That big block of code that I commented out was necessary. Turns out that its removal was causing 2 failures. I am adding it back in now, and attempting to tackle those 2 assertion errors.

Question: I have added a commit to my branch, and I suspect I can then push updated code into this PR (i.e., as I fix things that were erroring and adjust/add tests)? Is this correct?

lunamorrow avatar Mar 27 '24 09:03 lunamorrow

@yuxuanzhuang can you please look after this PR or assign to someone else, please?

orbeckst avatar Mar 29 '24 03:03 orbeckst

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

yuxuanzhuang avatar Mar 29 '24 06:03 yuxuanzhuang

Question: I have added a commit to my branch, and I suspect I can then push updated code into this PR (i.e., as I fix things that were erroring and adjust/add tests)? Is this correct?

Yes, that's correct! After you've added a commit to your branch, you can push the updated code to the same branch that is associated with this PR. Any new commits pushed to the branch will automatically appear in the PR (as you might already saw).

yuxuanzhuang avatar Mar 29 '24 06:03 yuxuanzhuang

Thanks @yuxuanzhuang I am working through your comments now and will aim to push the updated branch shortly!

Quick question about this comment:

You need to make sure acceptors_ag, hydrogens_ag, etc. always exist.

else:
    self.acceptors_ag = self.u.select_atoms(self.acceptors_sel)

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

Also, do I need to worry about the "This Universe does not contain charge information" errors? I do not think they have anything to do with my changes? I am worried they could be "hiding" other possible errors further down that aren't being reached before this error is thrown though...

lunamorrow avatar Mar 29 '24 06:03 lunamorrow

Regarding depreciation, current provided code examples will break e.g.

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein") # this will not be selection string.
hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
hbonds.run()

Should we consider accepting AtomGroup for hbonds.hydrogens_sel and other inputs as well?

yuxuanzhuang avatar Mar 29 '24 06:03 yuxuanzhuang

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

The logic behind guess_acceptors and the provision of an implicit acceptors_sel IMO is that with the latter, you are certain of the atoms you want to use for calculating hydrogen bonds (so you only need to select_atoms. In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

yuxuanzhuang avatar Mar 29 '24 06:03 yuxuanzhuang

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

Of course, all good! If you can just help me get this PR sorted and ready to go I'll be so grateful! The depreciation can be sorted later :)

I need to avoid using select_atoms (using guess_acceptors instead), but could you please elaborate on what you mean? I have defined them in the initialiser as None, so that they are always there, but I cannot figure out how your code piece suggestion integrates.

The logic behind guess_acceptors and the provision of an implicit acceptors_sel IMO is that with the latter, you are certain of the atoms you want to use for calculating hydrogen bonds (so you only need to select_atoms. In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

Oh ok yes, I understand what you mean now! Yeah I have been a bit unsure on how to tackle this in the "best" way. The guess_acceptors works the same as it used to, but I could add a layer of function that will consider the user input acceptors_sel if it is not None, instead of guessing based on Atom properties? Then change this line in _prepare (see below; add an else block) and pass a user defined self.acceptors_sel (if it exists) to define the Atoms in self.acceptors_ag? I am thinking I can set a selection string as an optional arg for guess_acceptors (and repeat this process for guess_hydrogens and guess_donors too)? Does this could like a good option?

        if self.acceptors_sel is None:
            self.acceptors_ag = self.guess_acceptors()
            self.acceptors_sel = self._group_categories(self.acceptors_ag) #NOTE: do I need this?

lunamorrow avatar Mar 29 '24 07:03 lunamorrow

Regarding depreciation, current provided code examples will break e.g.

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein") # this will not be selection string.
hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
hbonds.run()

Should we consider accepting AtomGroup for hbonds.hydrogens_sel and other inputs as well?

Yes I agree. I am thinking a good work-around would be for users to define hbonds.hydrogens_sel, hbonds.donors_sel or hbonds.acceptors_sel when they initialise the object? Otherwise they could manually manipulate it before calling run() and it should all flow well (once I make the adjustments to guess_acceptors, guess_hydrogens and guess_donors as per above). I would like to keep hydrogens_sel for selection strings and hydrogens_ag for the AtomGroups separated? I think this will provide the best functionality/robustness overall?

lunamorrow avatar Mar 29 '24 07:03 lunamorrow

@yuxuanzhuang I have made some changes, as I proposed above, which I will push shortly. I have pretty much eliminated the errors/failures bar these two.

MDAnalysis.exceptions.NoDataError: This Universe does not contain charge information

I suspect the charge one is due to your comment? Is there anything I can do to fix this?

In guess_acceptors, the Universe needs to have enough information, such as charges and masses, to make guesses; however, it's not always the case.

AttributeError: AtomGroup has no attribute replace.

I cannot find a call to replace in the tests throwing that error. Where can I find the root cause of this?

lunamorrow avatar Mar 29 '24 07:03 lunamorrow

I think there may be differences in the results when using guess_donors compared to _get_dh_pairs, because guess_donors includes a check for charge limitations. Note guess_donors is not used at all in the original implementation.

import MDAnalysis
from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA

u = MDAnalysis.Universe(PSF, DCD)

hbonds = HBA(universe=u)
hbonds.run()
print(hbonds.results.hbonds.shape)
> (10858, 6) # old results
> (4237, 6) # new results
hbonds._donors.n_atoms
> 375 # old results
> 74 # new results

will end up with a different result from the current implementation.

To facilitate testing, you can copy old hbond_analysis.py as hbond_analysis_old.py and from MDAnalysis.analysis.hydrogenbonds.hbond_analysis_old import HydrogenBondAnalysis as HBA_old.

I think this is currently not captured by the tests.

Apologies that I don't really have enough time during the Easter to make thorough suggestions on what should be done about it right now.

However, I would suggest to make sure you

  1. when doing select_atoms, also add updating=update_selections to make sure it will be an UpdatingAtomGroup
  2. make sure the things you parse around are not a string but an AtomGroup. AttributeError: AtomGroup has no attribute replace. this error is linked to putting an AtomGroup into select_atoms.
  3. you can do selections with group syntax e.g. donors_ag = hydrogens_ag.residues.atoms.select_atoms( f"({select}) and around {self.d_h_cutoff} group hydrogens_ag", hydrogens_ag=hydrogens_ag, updating=self.update_selections ).

yuxuanzhuang avatar Mar 30 '24 20:03 yuxuanzhuang

Thanks @yuxuanzhuang, and I completely understand. I appreciate you taking time out of your Easter weekend to give me a hand. Unfortunately I am really struggling with this and I think I might need to take it back to the start and re-implement changes again from scratch.

I have spent a good few hours reading over and tweaking code (including adjusting for your suggested changes 1 and 2), and it's still not working nicely with the tests. I have a number of tests that are getting empty AtomGroups, when the selection provided should return something, which is causing a plethora of problems.

I'm getting a bit worried about finalising this fix before GSoC applications close, but I will keep this PR updated with any more questions as I go.

lunamorrow avatar Apr 01 '24 11:04 lunamorrow

I'm getting a bit worried about finalising this fix before GSoC applications close, but I will keep this PR updated with any more questions as I go.

Don't stress about the timeline, there is no requirement that your PR is merged before the application deadline, only that we have had a chance to work with you and go back and forth a bit. :)

hmacdope avatar Apr 01 '24 11:04 hmacdope

I also don't have experience handling depreciation before, so I might need help from @hmacdope :)

We can move this to a seperate PR when the time comes. :)

hmacdope avatar Apr 01 '24 11:04 hmacdope

Ok great, that is such a relief to hear @hmacdope. I have been panicking a bit! I'll focus on my written application and keep chugging away at this fix while I wait to hear if I get accepted :)

We can move this to a seperate PR when the time comes. :)

Perfect, I'll touch base with you before starting that to get some more clear direction!

lunamorrow avatar Apr 01 '24 11:04 lunamorrow

Yes I agree. I am thinking a good work-around would be for users to define hbonds.hydrogens_sel, hbonds.donors_sel or hbonds.acceptors_sel when they initialise the object?

Much preferred to setting them by hand.

hmacdope avatar Apr 01 '24 11:04 hmacdope

Side note while testing: slicing UpdatingAtomGroup will result in a static AtomGroup. -> need a workaround to return an UpdatingAtomGroup in e.g. guess_donors because donors_ag will be sliced by donors_ag[donors_ag.charges < max_charge]

u.select_atoms('protein', updating=True).__class__
> MDAnalysis.core.groups.UpdatingAtomGroup
u.select_atoms('protein', updating=True)[2:].__class__
> MDAnalysis.core.groups.AtomGroup

yuxuanzhuang avatar Apr 03 '24 15:04 yuxuanzhuang

If you write this as as selection instead of slicing, can you maintain the updating group

donors_ag.select_atoms(f"prop charge < {max_charge}", updating=True)

orbeckst avatar Apr 03 '24 17:04 orbeckst

Side note while testing: slicing UpdatingAtomGroup will result in a static AtomGroup. -> need a workaround to return an UpdatingAtomGroup in e.g. guess_donors because donors_ag will be sliced by donors_ag[donors_ag.charges < max_charge]

If you write this as as selection instead of slicing, can you maintain the updating group

donors_ag.select_atoms(f"prop charge < {max_charge}", updating=True)

This is great to know, thanks @orbeckst and @yuxuanzhuang! I'll make that change to guess_donors, which should help with some of the testing issues. I just checked guess_acceptors and guess_hydrogens, and the same is happening with the AtomGroup selections in these methods, so I will workshop some solutions for them. I'll get these changes done and a few more, then I'll push a new version and share any testing errors/failures for it that I cannot solve.

Just one question: will the use of select_atoms in this solution be appropriate in the method, since it will still be returning/passing an AtomGroup?

lunamorrow avatar Apr 04 '24 00:04 lunamorrow

Hi @yuxuanzhuang @orbeckst @hmacdope. I'm jumping back on this PR as I have just finished off a bunch of university assessment. A couple weeks away from it was really beneficial for me to gain a fresh perspective and make fixes/changes relatively quickly. I have just committed updated versions of hbond_analysis and test_hydrogenbonds_analysis and I am pretty confident that I have implemented the changes as we discussed and the tests are running nicely. The code still needs some "cleaning" as I have left comments and some random prints here and there for while I was making changes, and some of the docstrings need a bit of TLC too. If everything looks good I will clean these up and commit a fresh version to be merged with version 3. Thanks! 😊

lunamorrow avatar Apr 16 '24 03:04 lunamorrow

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 93.23%. Comparing base (804b607) to head (39e77c9).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4533      +/-   ##
===========================================
- Coverage    93.64%   93.23%   -0.42%     
===========================================
  Files          168       12     -156     
  Lines        21254     1079   -20175     
  Branches      3918        0    -3918     
===========================================
- Hits         19904     1006   -18898     
+ Misses         892       73     -819     
+ Partials       458        0     -458     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 24 '24 22:04 codecov[bot]

Thanks @yuxuanzhuang and @orbeckst.

Currently, one can use:

hbonds = HydrogenBondAnalysis(universe=u)
hbonds.hydrogens_sel = hbonds.guess_hydrogens("resname ARG HIS LYS")
hbonds.acceptors_sel = hbonds.guess_acceptors("resname ARG HIS LYS")

However, this will fail with the current implementation because hbonds.acceptors_sel requires a string, whereas guess_acceptors returns an AtomGroup.

While it's acceptable to have a breaking change in the API, it would be beneficial to ensure that most of the existing code remains functional.

I understand what you mean. Just to make sure we are on the same page, I believe once these changes are implemented that I should be doing a new PR per @hmacdope's guidance that will issue a depreciation warning for some of these features and how the Class is currently used. If I ensure that the API/docstring examples are updated appropriately will that be ok? I cannot see a clean way to implement this change without making all the "guess" methods return AtomGroups instead of selection string.

There are some other examples which will lose functionality. Like being able to guess hydrogens or acceptors from a certain group, like a protein (see below). Since my implementation only uses guess_hydrogens, guess_acceptors and guess_donors if their respective selection string is None. Would it be suitable to make the resulting self._hydrogens, self._acceptors and self._donors mutable through a new method before run is called? Or I could implement a check for if any of self._hydrogens, self._acceptors and self._donors are empty AtomGroups, the respective guess_hydrogens, guess_acceptors or guess_donors could be called with the selection string passed in. Its a bit of a clunky fix though... Any suggestions on this would be really appreciated!

  import MDAnalysis
  from MDAnalysis.analysis.hydrogenbonds.hbond_analysis import HydrogenBondAnalysis as HBA

  u = MDAnalysis.Universe(psf, trajectory)

  hbonds = HBA(universe=u)
  hbonds.hydrogens_sel = hbonds.guess_hydrogens("protein")
  hbonds.acceptors_sel = hbonds.guess_acceptors("protein")
  hbonds.run()

I have gone through the rest of your comments and change requests, and made those changes to the code. I will clean up the comments and docstrings now, then push a new version for your review. Thank you :)

lunamorrow avatar Apr 25 '24 03:04 lunamorrow

Just a heads up, it's often necessary to rebase your code and resolve any conflicts with the latest develop branch. Here’s a helpful guide on how to do that: Rebasing Your Code.

Last time, I performed the rebase directly on GitHub, which might have caused some issues for you (apologies!) You could git pull the modification, if there's any on GitHub, before you push.

yuxuanzhuang avatar Apr 25 '24 20:04 yuxuanzhuang

Thanks @yuxuanzhuang. I realised I had forgotten to rebase before pushing the previous code, when I saw you had done the rebase for me. I made sure to do so for the most recent one. That's ok, I was a bit crazy on my end for the most recent push but I think I have it all sorted now! I'll just need to push a new branch to this PR somehow for further changes.

lunamorrow avatar Apr 26 '24 00:04 lunamorrow

Ok finally rebased and fixed up now. Everything should be sorted :)

lunamorrow avatar Apr 26 '24 01:04 lunamorrow

Thanks for your quick feedback @yuxuanzhuang! :)

See the inline comment and the code below that gets different results after the fix.

That is really weird as far as the results differing between the old and updated HBA. I will have a dig around on my end this week and fix it up.

Given all of your tests pass, it might be worth adding it to your test as well and making sure it also passes.

Will do!

It's worth noting that it runs so so so much faster!!!

hbonds = HBA(universe=u)
%timeit -r 2 -n 5 hbonds.run()
> old 2.85 s ± 8.15 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)
> new 168 ms ± 6.62 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)

Wow that is huge, and will have a really big impact for users performing HBA on large systems! That is so amazing to see the impact that this one change has had.

I'll wait to hear back from the core devs as far as the loss of some functionality.

lunamorrow avatar Apr 29 '24 08:04 lunamorrow

You're asking for other @MDAnalysis/coredevs to jump in and comment "as far as the loss of some functionality.". Can you please summarize the changes so we don't have to read everything in the PR?

  • What was changed?
  • What is behaving differently now (API, results, ...)?

Thanks.

orbeckst avatar Apr 29 '24 16:04 orbeckst