mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

`H5MDReader/Writer` undocumented limitations

Open edisj opened this issue 1 year ago • 5 comments

This is following up on the points brought up by @ljwoods2 in discussion https://github.com/MDAnalysis/mdanalysis/discussions/4559

The H5MD (https://www.nongnu.org/h5md/h5md.html#h5md-format-specification-version-work-version) format is very flexible in terms of which data the user decides to store from any molecular simulation engine. H5MDReader and H5MDWriter currently handle only a restricted subset of all possible h5md files, and as @ljwoods2 pointed out, this isn't documented/handled clearly.

The two undocumented limitations are:

  1. the /particles group (https://www.nongnu.org/h5md/h5md.html#particles-group) in an h5md file can have 1 or more subgroups that represent any arbitrary selection of atoms from a simulation, however in the current implementation of H5MDReader, it only reads in the first group found: https://github.com/MDAnalysis/mdanalysis/blob/73acc9bf9631c135dfd40322e121f379abfce198/package/MDAnalysis/coordinates/H5MD.py#L608-L611

  2. the ~/step and ~/time datasets (https://www.nongnu.org/h5md/h5md.html#explicit-step-and-time-storage) for any H5MD element can be unique to that element, e.g. one can write every integration timestep out for position data, but only write every other timestep out for velocity data. Currently, H5MDReader assumes all data in the input h5md file share a common step and time: https://github.com/MDAnalysis/mdanalysis/blob/73acc9bf9631c135dfd40322e121f379abfce198/package/MDAnalysis/coordinates/H5MD.py#L681-L693

Expected behavior

H5MDReader and H5MDWriter should catch these cases and raise an appropriate error

Actual behavior

For limitation 1: the reader just reads the positions, velocities, and forces of the first trajectory found and does not alert the user For limitation 2: the reader just reads the first step and time found and assumes the same value for other datasets

immediate solution

As per @orbeckst 's suggestion here https://github.com/MDAnalysis/mdanalysis/discussions/4559#discussioncomment-9038852, the an immediate solution is to fail explicitly in these cases and document the limitation.

future solution?

  1. Using an H5MD file to store an ensemble of trajectories actually sounds very useful. Could an extra keyword argument like which_trajectory or trajectory_name or something be used here to specify which group in /particles to load for the universe? An exception (or warning) can be thrown if the reader detects multiple trajectories and no argument passed, otherwise it would just load the first group in /particles if len(f['particles']) == 1 like before

  2. I think self.has_positions, self.has_velocities, etc. must be determined and set dynamically on a per-timestep basis if positions, velocities, and forces can be arbitrarily read for any timestep, unless the access pattern is obvious in cases where it's like "velocities on odd frames only". Would that mess with the performance of the buffer arrays in memory for a Timestep? From a quick glance it looks like it shouldn't be an issue because the Timestep will just throw a NoDataError but I'm not sure

edisj avatar Apr 08 '24 00:04 edisj

The TRR reader handles arbitrary occurrences of x, v, f in a Timestep. However, it's quite a bit slower that others (see Fig 9 in 10.25080/majora-1b6fd038-005). I don't know if the lower performance is (partially) due to checking x/v/f or due to other reasons.

orbeckst avatar Apr 09 '24 01:04 orbeckst

With GROMACS apparently committed to supporting H5MD, including the feature that x,v,f can be stored at separate, arbitrary steps, we should make sure that our implementation can do that properly.

orbeckst avatar May 30 '24 22:05 orbeckst

@orbeckst can I ask you about the multiple groups in /particles/ -

is there any case that makes sense for a mda.Universe to load multiple trajectories from multiple particle groups in /particles/? I can't think of how that would work... A universe should only contain a single trajectory, right?

If that's the case, then if a *.h5md file in the wild does contain multiple groups in /particles, the best way I can think of to handle which trajectory MDAnalysis chooses to load into ts.positions is by the user explicitly giving the group name as an argument.

The only other solutions I can think of is to fail right away, which I don't like because it's less flexible with the H5MD format, or to implicitly (and silently?) load only one of the groups, which I like even less...

edisj avatar May 31 '24 00:05 edisj

For right now, having a kwarg group for choosing the particle group such as group="<group 1>" makes most sense to me. group=None could be choosing the first one, as is currently done.

Later one could think about ways to combine groups to match the topology, although, given that these can be sampled at different times, this will likely be messy.

So I'd just start with giving the user the option to pick one of the groups.

orbeckst avatar May 31 '24 04:05 orbeckst

Just as a note: The sub kwarg of the XTCReader is describing somewhat related functionality, namely pulling a subsystem matching the topology out of a trajectory but sub requires a raw index array whereas group is a defined name (str).

orbeckst avatar May 31 '24 04:05 orbeckst