mdanalysis PDB Reader can stream from AWS s3 buckets with minimal modification

As per @hmacdope's request, here is how you can tweak PDBReader with a few lines of code to get it to read from an AWS S3 bucket:

https://github.com/ljwoods2/mdanalysis/pull/2/files

This allows you to do something like this:

import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF
import s3fs

s3_fs = s3fs.S3FileSystem(
    # anon must be false to allow authentication
    anon=False,
    profile='sample_profile',# use profiles defined in a .aws/credentials file to store secret keys
    client_kwargs=dict(
        region_name='us-west-1',
    )
)

# PDB trajectory file is stored in an S3 bucket
# Trajectory used is PDB_small from MDAnalysisTests.datafiles
file = s3fs.S3File(s3_fs, "zarrtraj-test-data/pdb_small.pdb")

u = mda.Universe(PSF, file, format="PDB")
for ts in u.trajectory:
    print(u.atoms)

This works because File-like objects are accepted by the PDBReader (and potentially other formats, @orbeckst suggested the GRO format may be able to do this as well) and S3File objects implement this interface.

For a large trajectory, this would be extremely slow, but could be sped up with caching.

Apr 19 '24 00:04 ljwoods2

Related to https://github.com/MDAnalysis/mdanalysis/issues/4139

Apr 19 '24 06:04 RMeli

There's nothing in the code that needs changing, so this is more of a "let's document fun things one can do", perhaps for a "hacking around MDAnalysis section".

May 09 '24 22:05 orbeckst

Also note that this functionality will fail to work when we were to switch to accelerated text-based readers, which would be based on a C++/Cython implementation.

May 09 '24 22:05 orbeckst