Make the `AnalysisBase` compatible with live simulation streams
Is your feature request related to a problem?
Related to #4827: for MDAnalysis to function with live simulation streams, the AnalysisBase must be modified to account for a few quirks of live-streamed trajectories:
- Their length/end frame isn't known (so
len(u.trajectory)andu.trajectory.n_frameswill fail) - They can't be sliced in any way except
u.trajectory[:]andu.trajectory[::n](where n is positive) so they are inherently incompatible with parallelization schemes
Describe the solution you'd like
Like with parallelization, analysis classes that inherit from AnalysisBase should "opt-in" to stream usage through a class property similar to the parallelizable property (like streamable or similar). To opt-in, inherited classes must also avoid attempting to access the trajectory's length or slice it. The rms.RMSF class is already such a class.
The AnalysisBase must also have a _streamed_run method that never attempts to slice a trajectory or access its length. The imdclient MDAKit ships such a method already (and monkey patches it in): https://github.com/Becksteinlab/imdclient/blob/main/imdclient/streamanalysis.py
Describe alternatives you've considered
Don't allow live simulation streaming, or don't allow builtin analysis classes to be used with these streams
Can this issue be addressed independently from #4827 ?
Can this issue be addressed independently from #4827 ?
Fixing this relies on being able to check if a reader inherits from StreamReaderBase (and failing if a stream reader tries to use an analysis class that isn't streamable) which would be a part of #4827- so good point, they are not entirely independent. The rest of the changes would be independent.
Generally, I advocate for including essential (and lightweight) components from IMDClient into MDAnalysis. Here are a few thoughts and comments:
-
Initialization of Results: I’m uncertain how you plan to handle the initialization of results for (almost all) analyses (e.g., example in RMS.py), considering you don’t know the
n_framesbeforehand. -
Streamable Analyses: Some analyses may or may not be streamable under certain conditions. For example, in the case of
RMSD, I assume that if the reference is anIMDReaderandref_frameis not zero, it would prevent the analysis from being streamable. -
IMDReader Compatibility: Instead of modifying
AnalysisBaseto be streamable, would it be feasible to makeIMDReader“AnalysisBase-compatible”? That is if necessary changes forIMDReaderitself can be made to make iteration overIMDReaderbehaves similarly to other Readers? -
Slicing and Stopping Iteration in IMDReader: Expanding on that, I wonder if disallowing the
stopparameter for slicingIMDReaderis strictly necessary. I may be missing some details, but it seems theoretically possible to stopIMDReaderiteration after processing n frames. IfIMDReadercould support slicing (start=0, step, stop), it might naturally enable settingn_framesbased on the slice.
@yuxuanzhuang Thank you for the in-depth look!
Overall, I agree with you that we could adapt the reader to be more compatible.
As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.
Maybe we allow slicing, but raise a RuntimeError if the start frame isn't reached or the end frame isn't reached? What do we pass as the end frame if no stop is defined? In this case, we still would need the AnalysisBase modifications to handle an unknown-length stream- though this could be avoided if we require the stop kwarg for stream readers. In this case, should we still raise an error if that frame isn't reached? Or should we give people their intermediate results?
That might be the best path forward- requiring people to pass in an end frame, defaulting to 0 for the start frame, and failing out if the end frame isn't reached, perhaps with some kind of option to save intermediate results to csv or something. That would also allow us to avoid adding complexity with parsing logic for all the different analysis class' frame-related kwargs like ref_frame, and we could initialize results arrays with the required end frame arg as the length.
Edit: Say something like ref_frame is a frame in the middle of the trajectory- what do we do with the frames up until the ref_frame which isn't yet available (can't just jump directly to it)?
As for the initialization of results, my thinking was that we would use python lists to hold intermediate results, and when iteration stops and _conclude is called, the list is converted to a numpy array and made available exactly as before.
Many analyses also rely on array assignment within single_frame, so the list-appending approach wouldn’t work in these cases or unless you want to refactor the whole AnalysisBase. For example, see this usage in MSD analysis.
I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during _stream_run at the end.
Good idea.
I hope that the discussion on this issue leads to a feasible and general solution, similar to how the parallelization was refined through discussions. It‘s important for the future of MDAnalysis.
I don’t have a fully fleshed-out solution yet, but I wonder if it might be possible to implement a “block” analysis approach so it could be processed in chunks of a fixed number of frames, and then the results from each block could be concatenated during
_stream_runat the end.
This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that size == blocksize or size == mod(full_len, blocksize) for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.
I haven't had time to think this through fully yet, but one initial comment / thought based on @yuxuanzhuang's comment "Instead of modifying AnalysisBase to be streamable, would it be feasible to make IMDReader “AnalysisBase-compatible”?"
I'm not sure if it's in the same spirit, but I am concerned that we are following the path of "let's keep on adding things to AnalysisBase because we can".
It sounds to me like enabling analyses on streams is going to need a lot of analysis-specific work. Is this a fair assessment? If so, is it maybe worth consider if we need either a mixin or a separate subclass to make analyses stream compatible? (rather than adding more to the one base class)
This is a really clever idea IMO, if the trajectory could be processed in blocks guaranteeing that
size == blocksizeorsize == mod(full_len, blocksize)for the end of the stream , then each can call a full AnalysisBase run and then concatenate the resulting results objects. Is this what you were thinking @yuxuanzhuang? It is similar to the way we approached parallelisation IIRC.
Yes, exactly. With such, I think the existing _get_aggregator for parallel analysis should just be the right tool for concatenation so no extra function will be needed for that.