PDB Reader can stream from AWS s3 buckets with minimal modification
As per @hmacdope's request, here is how you can tweak PDBReader with a few lines of code to get it to read from an AWS S3 bucket:
https://github.com/ljwoods2/mdanalysis/pull/2/files
This allows you to do something like this:
import MDAnalysis as mda
from MDAnalysisTests.datafiles import PSF
import s3fs
s3_fs = s3fs.S3FileSystem(
# anon must be false to allow authentication
anon=False,
profile='sample_profile',# use profiles defined in a .aws/credentials file to store secret keys
client_kwargs=dict(
region_name='us-west-1',
)
)
# PDB trajectory file is stored in an S3 bucket
# Trajectory used is PDB_small from MDAnalysisTests.datafiles
file = s3fs.S3File(s3_fs, "zarrtraj-test-data/pdb_small.pdb")
u = mda.Universe(PSF, file, format="PDB")
for ts in u.trajectory:
print(u.atoms)
This works because File-like objects are accepted by the PDBReader (and potentially other formats, @orbeckst suggested the GRO format may be able to do this as well) and S3File objects implement this interface.
For a large trajectory, this would be extremely slow, but could be sped up with caching.
Related to https://github.com/MDAnalysis/mdanalysis/issues/4139
There's nothing in the code that needs changing, so this is more of a "let's document fun things one can do", perhaps for a "hacking around MDAnalysis section".
Also note that this functionality will fail to work when we were to switch to accelerated text-based readers, which would be based on a C++/Cython implementation.