Enabling of `MDAnalysis.analysis.align.AverageStructure` parallelization
Fixes #4659 attempt
Changes made in this Pull Request:
- added
backendsandaggregatorstoAlignTrajandAverageStructureinanalysis.align. - added the
client_AlignTrajandclient_AverageStructureinconftest.py - added
client_AlignTrajandclient_AverageStructureinrun()intest_align.py
Currently for AlignTraj it only accepts serial and dask with multiprocessing leading to the pytests taking forever. An additional error that appears is the following:
OSError: File opened in mode: self.mode. Reading only allow in mode "r"
For AverageStructure the Failure that appears is the following:
AttributeError: 'numpy.ndarray' object has no attribute 'load_new'
Which leads me to believe that AverageStructure can not be parallelized, but I would need additional opinions on it and on AlignTraj :)
PR Checklist
- [x] Tests?
- [x] Docs?
- [x] CHANGELOG updated?
- [x] Issue raised/referenced?
Developers certificate of origin
- [x] I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.
📚 Documentation preview 📚: https://mdanalysis--4738.org.readthedocs.build/en/4738/
Hello @talagayev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
testsuite/MDAnalysisTests/analysis/test_align.py:
Line 310:80: E501 line too long (86 > 79 characters) Line 327:80: E501 line too long (87 > 79 characters) Line 333:80: E501 line too long (97 > 79 characters) Line 357:80: E501 line too long (85 > 79 characters) Line 377:80: E501 line too long (91 > 79 characters)
Comment last updated at 2025-01-11 21:40:18 UTC
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 92.73%. Comparing base (b62e2f7) to head (fbb85ac).
:warning: Report is 2 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #4738 +/- ##
===========================================
+ Coverage 92.72% 92.73% +0.01%
===========================================
Files 180 180
Lines 22472 22491 +19
Branches 3188 3191 +3
===========================================
+ Hits 20837 20858 +21
+ Misses 1177 1176 -1
+ Partials 458 457 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@marinegor I would ping you in this PR as well.
Here basically I tried different ways to see if it is possible to parallelize the AverageStructure and AlignTraj classes.
I was able to implement the parallelization for AverageStructure with the first attempt not working due to being able to read a universe so that error appeared
AttributeError: 'numpy.ndarray' object has no attribute 'load_new'
I added _first to make the class parallelizable. This works well with the test, expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues, so in the current case I revert back to serial for cases when in_memory=True.
As for AlignTraj due to it transforming coordinates and writing out structures it would be necessary to rewrite more parts of the code to make it parallelizable, so I didn't find an easy solution for that, which wouldn't require bigger modifications of the class. So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?
@talagayev
expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues
I think parallelization should not actually work with in_memory cases (@yuxuanzhuang please correct me if I'm wrong, afaik you've been working on this). Hence I'd explicitly raise an exception if uses asks for parallel execution and provides in_memory as well.
So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?
If you're running out of ideas, I'd suggest making this PR for AverageStructure only, and create appropriate issue for AlignTraj, describing your attempts so far.
Also, I imagine there are problems with serialization of self._writer, no? Perhaps we can chat on discord about it (I'm @marinegor there)?
@talagayev
expect the case, when in_memory=True, then the process takes very long, which I assume is connected to memory issues
I think parallelization should not actually work with
in_memorycases (@yuxuanzhuang please correct me if I'm wrong, afaik you've been working on this). Hence I'd explicitly raise an exception if uses asks for parallel execution and providesin_memoryas well.So there the question would be if we keep it as non parallelizable or should I try to modify the code to make it parallelizable, which would require bigger modifications?
If you're running out of ideas, I'd suggest making this PR for
AverageStructureonly, and create appropriate issue forAlignTraj, describing your attempts so far.Also, I imagine there are problems with serialization of
self._writer, no? Perhaps we can chat on discord about it (I'm@marinegorthere)?
Yes makes sense. I think the current two ones that use in_memory and are analysis related are AverageStructure and AlignTraj.
Yes that would be good, I can then rename the PR to cover only AverageStructure for now, add the missing parts for the PR (Documentation + Changelog), create an Issue and write you on Discord, so that we can brainstorm how to adjust the code to make it parallelizable. Yes self._writer is one of the difficulties. I guess for the aligntraj you can adjust the code to give it the reference, but yes the writing during parallelization is the tricky part, maybe with tmp information that is then merged or maybe just doing the calculations and the writing is then in conclude, basically keeping that part serial and only making the calculations parallel.
@talagayev ok, will be waiting for your message. I also assigned myself a reviewer here, so just re-request review when you think you're done!
@talagayev ok, will be waiting for your message. I also assigned myself a reviewer here, so just re-request review when you think you're done!
Added the Documentation, CHANGELOG and adjust to raise and ValueError. The PR would be ready to be re-reviewed :)
@talagayev all looking good! I initially commented on one extra line but just realized I can remove it myself.
Also, may I ask you to create a separate issue for
AlignTrajparallelization? Perhaps you could describe the issues you encountered there, and suggest the direction in which one should move to actually enable it.
Perfect :)
Yes, I can create an Issue and write some Ideas in there.
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@talagayev and regarding AlignTraj -- I'd just be bold and say that it's impossible to parallelize that with current split-and-combine technique, since that would require an ability to write an arbitrary frame with self._writer, and not only sequential writing. I don't think any writers allow that.
So even in this PR, set available_backends to only serial explicitly, and note that it's impossible to parallelize. And as for the next issue, I'd target adding an option to save aligned trajectory to self.results, and writing it in _conclude.