mdanalysis 'MDAnalysis.analysis.diffusionmap' parallelization

Fixes #4679 attempt

Changes made in this Pull Request:

added backends and aggregators to DistanceMatrix in analysis.diffusionmap
added the client_DistanceMatrix in conftest.py
added client_DistanceMatrix in run() in test_diffusionmap.py

Here is the Problem:

From what I see currently self.results.dist_matrix = np.zeros((self.n_frames, self.n_frames)) has the problem, that when it uses parallelization the self.n_frames value gets divided, which leads with the ndarray_sum and ndarray_mean to:

E AssertionError: E Arrays are not almost equal to 4 decimals E (shapes (2,), (4,) mismatch) E ACTUAL: array([1., 1.]) E DESIRED: array([1, 1, 1, 1])

and with the use of ndarray_vstack or ndarray_hstack it leads to the following error: numpy.linalg.LinAlgError: Last 2 dimensions of the array must be square

so I am not sure if this can be somehow adjusted, I also encountered a similar case also while trying to parallelize analysis.msd

PR Checklist

[x] Tests?
[ ] Docs?
[ ] CHANGELOG updated?
[x] Issue raised/referenced?

Developers certificate of origin

[x] I certify that this contribution is covered by the LGPLv2.1+ license as defined in our LICENSE and adheres to the Developer Certificate of Origin.

📚 Documentation preview 📚: https://mdanalysis--4745.org.readthedocs.build/en/4745/

Oct 19 '24 22:10 talagayev

Hello @talagayev! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2024-12-19 01:33:22 UTC

Oct 19 '24 22:10 pep8speaks

hey @talagayev the msd PR seems to pass all the tests (#4896) -- do you think it's something you could use here as well?

I skimmed through the code and I think it overcomes your self.n_frames problem by setting up an array of shape (..., self.n_frames) before running analysis, and later you can use that as a replacement for self.n_frames

Nov 09 '25 21:11 marinegor

hey @talagayev the msd PR seems to pass all the tests (#4896) -- do you think it's something you could use here as well?

I skimmed through the code and I think it overcomes your self.n_frames problem by setting up an array of shape (..., self.n_frames) before running analysis, and later you can use that as a replacement for self.n_frames

Hey @marinegor,

Yes I think that could be a solution for that PR as well. I will try to implement it the coming days. Sadly I didn't have much time the last couple of months, so I didn't look into the PR.

I would also have a draft for the #4659 that I would have locally currently, that would add parallelization for some part of MDAnalysis.analysis.align. Currently the PR is closed, is it fine if I ping you there in a couple of days, when I reopen it an push the changes?

Nov 10 '25 09:11 talagayev

Sure, no problem, ping me when you feel necessary!

Nov 10 '25 15:11 marinegor

Codecov Report

:x: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 92.72%. Comparing base (3189d48) to head (730cf94). :warning: Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
package/MDAnalysis/analysis/diffusionmap.py	96.15%	0 Missing and 1 partial :warning:

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #4745   +/-   ##
========================================
  Coverage    92.72%   92.72%           
========================================
  Files          180      180           
  Lines        22458    22472   +14     
  Branches      3186     3188    +2     
========================================
+ Hits         20824    20837   +13     
  Misses        1177     1177           
- Partials       457      458    +1

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Nov 17 '25 02:11 codecov[bot]

@marinegor I tried now in the last couple of days different approaches and that one is one of the ones that looks working on all pytests. In that option I have to move parts of the code from _single_frame to _conclude, since from what I understood the main difficulty of getting DistanceMatrix to be parallelizable is that for the calculation it needs to have all the info. Had to use AI in some cases to figure out that this was the main difficulty to get it parallelized.

Also looked if #4892 would help with the global information, but from what I see it would still not have all the complete information, which was in the serial available with that being the reason why there it was used in _single_frame and here with parallelization we would split it up, workers won't have the information from the other workers, so the calculation is here in _conclude. So the speed increase for that one with parallelization should be not significant compared to some other functions.

Nov 19 '25 14:11 talagayev