iris icon indicating copy to clipboard operation
iris copied to clipboard

Extend Iris loading to accept S3 URLs

Open JoshuaWiggs opened this issue 10 months ago • 20 comments

✨ Feature Request

Add capability to the Iris loading and io modules to utilise a S3 URL to load a data cube into memory.

Proposed used would be something like cube = iris.load("s3://some-bucket/some-object")

Motivation

This feature would allow us to make more optimal use of our AWS cloud based platforms by removing the need to copy data files from our object store to a mount file system before working with them.

Additional context

Click to expand this section... This feature is required in order to allow us to reduce remove the duplication of the s3 input and output data. This will allow us to just have the one instance input data in the s3 input bucket being used directly by our science workflows. This will allow the reduction of our fsx storage capacity which is currently our greatest spend on AWS infrastructure.

This could be accomplished by adding a S3 loading method into Iris utilising the boto3 (https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) Python library. Therefore, this would add that library as an optional dependency to use this loading method.

The scope of this feature is limited to loading data from an S3 bucket. Additionally, this feature will assume that you have already correctly configured your environment to access the S3 bucket that is being targeted.

JoshuaWiggs avatar Mar 21 '25 11:03 JoshuaWiggs

From @SciTools/peloton : we should set up an internal meeting to discuss this and scope out the requirements.

ukmo-ccbunney avatar Mar 26 '25 10:03 ukmo-ccbunney

OK I couldn't resist looking into this, because I thought there was a "quick win". On reviewing the features of netcdf4-python, I think it does after all provide what is needed : it can create an in-memory Dataset from a byte buffer

I have made a POC here

In short, this shows that you can load a netcdf4 file in a memory buffer, make a netCDF4.Dataset from that, and load that direct into Iris. I presume that S3 loading can generate a bytes buffer object, just like a file load. Don't actually know too much about that bit.

pp-mo avatar Apr 30 '25 11:04 pp-mo

S3 loading can generate a bytes buffer object

maybe relevant ? https://stackoverflow.com/questions/36205481/read-file-content-from-s3-bucket-with-boto3

pp-mo avatar Apr 30 '25 11:04 pp-mo

Just demo-ed this at AVD "Surgery" day Very exciting to see it working!

But then ... found that Netcdf "Proxy" objects don't function on in-memory datasets because of expectation to re-open from a dataset path.
Temporarily fixed by iris.fileformats.netcdf.loader._LAZYVAR_MIN_BYTES = 1e10 This could probably be fixed by adding awareness of in-memory datasets to our creation of the DataProxy objects. Or possibly, it makes sense to just "not do lazy" in these cases, as above enforces ??

Since then, also, discussed whether supporting S3 as a specific url option is a "Good Idea" ™

  • ✔ utility for users : just replace path with url
  • ❌ means extra dependency (boto3?) and feature maintenance
  • ❌ most existing filetypes are plain unsuitable, since S3 doesn't allow partial access
    • .... and we don't (yet) support Zarr (directly), which is the obvious choice
    • ?? but could re-introduce a kind of laziness ==> scan one-by-one (multiple smaller files -- see below) ; extract metadata; don't store it all
  • probably need to support wildcards + multiples, as for regular loads
  • probably support other filetypes, e.g. PP --> ?? all, i.e. (somehow) independent of type ??

pp-mo avatar May 01 '25 11:05 pp-mo

Script used to demonstrate loading from an S3 URL

import boto3
import iris
import iris.fileformats.netcdf.loader
import netCDF4 as nc
import iris.quickplot as qplt
import matplotlib.pyplot as plt


def main():

    # Create an S3 client
    s3 = boto3.client('s3')

    # Specify the bucket name and file name
    bucket_name = 's3_bucket'
    file_name = 'file.nc'

    # Upload the file to S3
    obj = s3.get_object(
        Bucket=bucket_name,
        Key=file_name
    )

    # Read the file content
    file_content = obj['Body'].read()

    ncds = nc.Dataset("in-memory.nc", memory=file_content)

    iris.fileformats.netcdf.loader._LAZYVAR_MIN_BYTES = 1e12
    cubes = iris.load(ncds)

    for cube in cubes:
        qplt.contourf(cube)
        plt.show()
    

if __name__ == "__main__":
    main()

JoshuaWiggs avatar May 01 '25 15:05 JoshuaWiggs

@cpelley pinging you for your interest

JoshuaWiggs avatar May 02 '25 12:05 JoshuaWiggs

Let's go for it 👍

We'd like to deliver this in Iris 3.14. To use our typical sprint sizing, we expect this to be a 'large' task.

Concerns we had

  • Building in Boto3 knowledge into Iris, so it becomes part of our maintenance burden?
    • Boto3 apparently has a reputation for stability, and uses SemVer.
    • All our code would be in a small corner of Iris, so wouldn't get in many people's way when developing.
  • Are we encouraging bad practices? I.e. NetCDF on S3 ONLY enables access to the whole file, never part of it.
    • The front-end API we write could extend to PP/GRIB (or Zarr in future), all of which make a lot more sense for this.
    • With NetCDF, there is no alternative that we would be discouraging.

Design

  • Should be an experimental feature.
    • We'll be relying on users to try it out.
    • Can achieve this by adding a run-time flag in iris.experimental, which can be checked in an if: block during the standard loading process. Previous example: https://github.com/SciTools/iris/blob/e4191fbe868b5dbdd7fabf22d938ee6111b0fb08/lib/iris/experimental/ugrid.py#L135-L136
  • The loading 'chain' should automatically detect S3 URL(s) in the URI(s) provided.
  • Design should allow for future S3 loading from other file formats.
    • PP and GRIB both use self-describing 2D slices ('fields'), so are ideal for downloading subsets of a larger dataset.
    • Zarr is designed with cloud in mind, although I don't know the details yet.
  • Should only support NetCDF for now - using the approach described in previous comments here.
  • Laziness should be disabled for S3 loading - we can't see an alternative when loading directly from memory.
  • Attempt to support wildcards in the URI(s), if possible.

trexfeathers avatar Jun 24 '25 11:06 trexfeathers

This is excellent news, let me know if there is anything we can do to help you with this process.

I was going to reach out about the possibility of adding a save to S3 to complete the I/O offering from Iris. I haven't investigated the feasibility of this integration yet, but I'm guessing it would be best to raise as a separate issue?

JoshuaWiggs avatar Jun 24 '25 12:06 JoshuaWiggs

@JoshuaWiggs save to S3

Not thought of that case ! I think probably can be done diskless, somehow, using the same 'in-memory-dataset' option in netCDF4-python, as in the above example.

As it was for loading, Iris can already save direct to an open netcdf dataset ( #5214 ), as this was likewise needed for ncdata integration.

However the catch is, I think, you must specify the netcdf loader function specifically as because it isn't a filepath with an extension it can't choose automatically. The ncdata "from_iris" function uses it here : In fact, as you can see, the necessary code to do this is perhaps unexpectedly "fussy".


@JoshuaWiggs feasibility of this integration ... would be best to raise as a separate issue?

@trexfeathers do we want to consider this for inclusion here -- or is that unwanted scope creep at this point ??

pp-mo avatar Jun 24 '25 13:06 pp-mo

from @SciTools/peloton:

We think implementing saving is a good idea, but we do think it should be raised as a separate issue to avoid scope creep. Thanks @JoshuaWiggs

ESadek-MO avatar Jun 25 '25 09:06 ESadek-MO

New issue #6535 raised for saving.

JoshuaWiggs avatar Jun 26 '25 14:06 JoshuaWiggs

@pp-mo @ukmo-ccbunney have you looked at whether iris could use fsspec and provide access to working with remote files on many different protocols, not just s3 (boto)?

jamesp avatar Sep 22 '25 14:09 jamesp

@pp-mo @ukmo-ccbunney have you looked at whether iris could use fsspec and provide access to working with remote files on many different protocols, not just s3 (boto)?

We investigated it from an S3 perspective. As far as we could tell: fsspec still downloads the file in the background, so only offering convenience, not any performance benefit, so it would not deliver on @JoshuaWiggs requirements.

Perhaps a case could be made for the convenience of many different protocols, as you say, but I'm generally against adding API if it just saves users a couple of steps - the cost-benefit may not be worth it given how long API sticks around for in Semantic Versioning. If it would make users' lives significantly easier then certainly worth considering.

trexfeathers avatar Sep 22 '25 14:09 trexfeathers

Ah ok, I was looking at this buffering and random access which suggests it doesn't download the whole file, but obviously the devil is in the detail that you've no doubt already encountered.

jamesp avatar Sep 22 '25 15:09 jamesp

I think that still downloads part of the file (with the size defined by the block size). We want to be able to stream the whole object straight into a memory buffer and then work with it directly in RAM. The ultimate goal for this is to allow us to do entirely diskless operations on data stored in s3.

JoshuaWiggs avatar Sep 22 '25 15:09 JoshuaWiggs

I must be missing something about your specific use case, which sounds pretty specialised, but I'm pretty sure fsspec can read into an in-memory buffer without touching the filesystem. Whatever implementation you choose you will at some level need a block / chunk size to stream the data over the network.

jamesp avatar Sep 23 '25 08:09 jamesp

I'm pretty sure fsspec can read into an in-memory buffer without touching the filesystem

Sounds like it's worth another look!

trexfeathers avatar Sep 23 '25 08:09 trexfeathers

@trexfeathers As far as we could tell: fsspec still downloads the file in the background ... not any performance benefit

@jamesp I was looking at this buffering and random access which suggests it doesn't download the whole file

@JoshuaWiggs still downloads part of the file ... want to be able to stream the whole object straight into a memory buffer ... to do entirely diskless operations

@jamesp I'm pretty sure fsspec can read into an in-memory buffer without touching the filesystem

Following a discussion here with Tom Gale from Australia BoM, I now believe that fsspec can do this, which I had not realised.
But I think, from a brief look, not boto3 which can only download a "whole" object. ( N.B. I believe in this context, 's3fs' essentially is fsspec for s3 ) So, I no longer think that the "whole-object-only access" is a real problem.

But read on ...

pp-mo avatar Oct 01 '25 14:10 pp-mo

Alternative approach ?

While investigating s3fs, I now found s3fs-fuse. This allows you to mount an S3 bucket as a filesystem (at least for unix-like OS). The potential benefits of this approach are pretty clear :

  • S3 support for any file-format -- since access is through a file-system interface
  • save as well as load -- likewise

I've cooked up a demonstration example how this could be supported in Iris : https://github.com/SciTools/iris/pull/6731

However, it has also now been pointed out that, whereas native Iris support has definite and possible drawbacks, if the use of f3fs-fuse was left to the user, that avoids most of the problems -- at least, they can be more readily solved in the context of specific usages. It's all pretty simple:

  • add "f3fs-fuse" to the Python environment
  • execute $ f3fs <s3-bucket-name> <mount-path>
  • run Python script, in which files are referenced by paths under <mount-path>
  • when finished, $ umount <mount-path>

What does anyone else think of re-framing the solution in this way?

pp-mo avatar Oct 01 '25 14:10 pp-mo

I need to understand a little more about how 'mounting' the s3 bucket actually happens. Can pop into surgery tomorrow to discuss @pp-mo if that would be helpful?

JoshuaWiggs avatar Oct 01 '25 14:10 JoshuaWiggs