mdanalysis icon indicating copy to clipboard operation
mdanalysis copied to clipboard

Lazy index building for XTC and TRR tajectories

Open jbarnoud opened this issue 3 years ago • 4 comments

Is your feature request related to a problem?

Reading an XTC or TRR trajectory for the first time can be very slow, especially for large files. This is due to the building of the index that allows us to access random frames.

We experienced issues due to the indexing with very large files (#3789) or with read-only file systems where the index had to be rebuilt each time.

Describe the solution you'd like

In most cases, however, we do not randomly access random frames. Instead, we iterate over the trajectory in order. There we do not need the index, so the overhead due to the first index building is unnecessary. We could build the index as we go by saving the offset when we encounter a new frame as we read the file.

This could cause issues with the parallel reading of trajectories. It could also make the reading slower in some cases, which is already an issue (#3731).

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Describe alternatives you've considered

  • Not change anything. The current method works in most cases. However, large files are becoming more and more common.
  • Not build the index. That would indeed save the overhead. It would also break some assumptions in the current code base and prevent users from accessing some features when the index is not built.

Additional context

jbarnoud avatar Aug 26 '22 06:08 jbarnoud

It is useful/important to know how many frames the trajectory contains, though. @Tsjerk has a nice trick where he reads the file backwards to find the last frame.

Do XTC and TRR contain reliable information about the total number of frames (and total time) in the last (readable) frame?

orbeckst avatar Nov 02 '23 19:11 orbeckst

I do not know how @Tsjerk's magic trick really works and how reliable it is. In my experience, it was fairly reliable, but I was only using it on very typical trajectories that had the same properties as the ones for which Tjerk wrote the script. I might still have the script somewhere in my archives somewhere?

jbarnoud avatar Nov 04 '23 11:11 jbarnoud

In an XTC file, the header contains the size of the coordinate section in bytes, and the header itself has a fixed size and starts with the magic number 1993. For indexing (and reversing) an XTC file, I use this to jump through the file with relative seek calls. This works well. I thought MDAnalysis had already adopted that trick. I have been playing around with guesstimating the positions of all frames from the first and last frame sizes. Whether this is reliable depends on the trajectory. If, e.g., PBC jumps are removed, then the drift in size of coordinates causes a drift in the size of the frames, making it hard to get proper guesstimates. For TRR the case is different. These have no compression, so sizes are fixed. However, a frame may contain coordinates, velocities, forces, in any possible combination. That means that frame sizes may be different throughout te file, and it is really hard to determine the frame positions in the file through a simple calculation. Again, this means that stepping through the TRR with fseeks is the most straightforward; the header contains all information required to determine the next frame position. For both XTC and TRR this thus means: find frame start, read header, determine position of next frame, fseek, repeat.

Tsjerk avatar Nov 04 '23 15:11 Tsjerk

Oh, reading through the original issue: 1. Finding the last frame is done by reading the first and saving the header tag (8 bytes: magic number and number of atoms), determine the size of a frame, then going to a position estimated 3/2 frames from the end, reading in everything up to the end and rfind the tag. 2. For parallel processing, the different nodes may be given filepointers spaced throughout the file, find the first frame/header and from there traverse up to the position where another worker started.

Tsjerk avatar Nov 04 '23 15:11 Tsjerk