mdanalysis Make the `AnalysisBase` compatible with live simulation streams

Is your feature request related to a problem?

Related to #4827: for MDAnalysis to function with live simulation streams, the AnalysisBase must be modified to account for a few quirks of live-streamed trajectories:

Their length/end frame isn't known (so len(u.trajectory) and u.trajectory.n_frames will fail)
They can't be sliced in any way except u.trajectory[:] and u.trajectory[::n] (where n is positive) so they are inherently incompatible with parallelization schemes

Describe the solution you'd like

Like with parallelization, analysis classes that inherit from AnalysisBase should "opt-in" to stream usage through a class property similar to the parallelizable property (like streamable or similar). To opt-in, inherited classes must also avoid attempting to access the trajectory's length or slice it. The rms.RMSF class is already such a class.

The AnalysisBase must also have a _streamed_run method that never attempts to slice a trajectory or access its length. The imdclient MDAKit ships such a method already (and monkey patches it in): https://github.com/Becksteinlab/imdclient/blob/main/imdclient/streamanalysis.py

Describe alternatives you've considered

Don't allow live simulation streaming, or don't allow builtin analysis classes to be used with these streams

Dec 11 '24 21:12 ljwoods2

Can this issue be addressed independently from #4827 ?

Dec 18 '24 18:12 orbeckst

Can this issue be addressed independently from #4827 ?

Fixing this relies on being able to check if a reader inherits from StreamReaderBase (and failing if a stream reader tries to use an analysis class that isn't streamable) which would be a part of #4827- so good point, they are not entirely independent. The rest of the changes would be independent.

Dec 18 '24 18:12 ljwoods2

Generally, I advocate for including essential (and lightweight) components from IMDClient into MDAnalysis. Here are a few thoughts and comments:

Initialization of Results: I’m uncertain how you plan to handle the initialization of results for (almost all) analyses (e.g., example in RMS.py), considering you don’t know the n_frames beforehand.
Streamable Analyses: Some analyses may or may not be streamable under certain conditions. For example, in the case of RMSD, I assume that if the reference is an IMDReader and ref_frame is not zero, it would prevent the analysis from being streamable.
IMDReader Compatibility: Instead of modifying AnalysisBase to be streamable, would it be feasible to make IMDReader “AnalysisBase-compatible”? That is if necessary changes for IMDReader itself can be made to make iteration over IMDReader behaves similarly to other Readers?
Slicing and Stopping Iteration in IMDReader: Expanding on that, I wonder if disallowing the stop parameter for slicing IMDReader is strictly necessary. I may be missing some details, but it seems theoretically possible to stop IMDReader iteration after processing n frames. If IMDReader could support slicing (start=0, step, stop), it might naturally enable setting n_frames based on the slice.

Dec 20 '24 03:12 yuxuanzhuang

@yuxuanzhuang Thank you for the in-depth look!

Overall, I agree with you that we could adapt the reader to be more compatible.

As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.

Maybe we allow slicing, but raise a RuntimeError if the start frame isn't reached or the end frame isn't reached? What do we pass as the end frame if no stop is defined? In this case, we still would need the AnalysisBase modifications to handle an unknown-length stream- though this could be avoided if we require the stop kwarg for stream readers. In this case, should we still raise an error if that frame isn't reached? Or should we give people their intermediate results?

That might be the best path forward- requiring people to pass in an end frame, defaulting to 0 for the start frame, and failing out if the end frame isn't reached, perhaps with some kind of option to save intermediate results to csv or something. That would also allow us to avoid adding complexity with parsing logic for all the different analysis class' frame-related kwargs like ref_frame, and we could initialize results arrays with the required end frame arg as the length.

Edit: Say something like ref_frame is a frame in the middle of the trajectory- what do we do with the frames up until the ref_frame which isn't yet available (can't just jump directly to it)?

Dec 20 '24 20:12 ljwoods2

As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.

Many analyses also rely on array assignment within single_frame, so the list-appending approach wouldn’t work in these cases or unless you want to refactor the whole AnalysisBase. For example, see this usage in MSD analysis.

I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during _stream_run at the end.

Dec 31 '24 04:12 yuxuanzhuang

Good idea.

I hope that the discussion on this issue leads to a feasible and general solution, similar to how the parallelization was refined through discussions. It‘s important for the future of MDAnalysis.

Dec 31 '24 08:12 orbeckst

I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during _stream_run at the end.

This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that size == blocksize or size == mod(full_len, blocksize) for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.

Jan 05 '25 20:01 hmacdope

I haven't had time to think this through fully yet, but one initial comment / thought based on @yuxuanzhuang's comment "Instead of modifying AnalysisBase to be streamable, would it be feasible to make IMDReader “AnalysisBase-compatible”?"

I'm not sure if it's in the same spirit, but I am concerned that we are following the path of "let's keep on adding things to AnalysisBase because we can".

It sounds to me like enabling analyses on streams is going to need a lot of analysis-specific work. Is this a fair assessment? If so, is it maybe worth consider if we need either a mixin or a separate subclass to make analyses stream compatible? (rather than adding more to the one base class)

Jan 06 '25 13:01 IAlibay

This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that size == blocksize or size == mod(full_len, blocksize) for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.

Yes, exactly. With such, I think the existing _get_aggregator for parallel analysis should just be the right tool for concatenation so no extra function will be needed for that.

Jan 07 '25 20:01 yuxuanzhuang