pygeoapi icon indicating copy to clipboard operation
pygeoapi copied to clipboard

Cloud Data Access Within Providers

Open Dylan-Pugh opened this issue 4 years ago • 1 comments

I've been experimenting with using AWS S3 as a backing store within a custom provider (extending XarrayEDRProvider), but have been running into a few issues.

I've tried two different approaches so far (opening a NetCDF file stored in S3):

1. Retrieving the file with boto3 and saving to a tempfile

This approach works, but requires a few helper methods, and I can see it becoming a bit clunky going forward:

if provider_def['data'].startswith('s3://'):
      tmp = tempfile.NamedTemporaryFile(delete=False)
      self.fetch_s3_file(provider_def['data'], tmp.name)
      self.data = tmp.name`

code for fetch_s3_file:

def fetch_s3_file(self, object_uri, out_file):
        
        bucket_name, key_name = self.split_s3_uri(object_uri)

        s3 = boto3.client('s3')

        try:
            s3.download_file(bucket_name, key_name, out_file)
        except Exception as err:
            LOGGER.warning(err)
            raise ProviderNoDataError(err)

2. Using fsspec to handle S3 operations

This is my preferred method, but I've been running into some issues. This code fails, and returns the following error: pygeoapi.provider.base.ProviderConnectionError: I/O operation on closed file.

with fsspec.open(S3_URL) as f:
    ds = xr.open_dataset(f)

This code (using a work-around mentioned in this s3fs issue) does work, but is extremely slow:

self._data = open_func(fsspec.open(provider_def['data']).open())

Describe the solution you'd like

Ideally I'd like to get fsspec working, and then figure out the best way to integrate/abstract the solution, so we can use data from a number of different cloud services going forward.

Is anyone working on something similar?

I've reviewed these issues, which may provide additional context:

https://github.com/geopython/pygeoapi/issues/724

https://github.com/geopython/pygeoapi/issues/807

I'm also happy to provide more detail - thanks!

Dylan-Pugh avatar Nov 24 '21 20:11 Dylan-Pugh

Not that I have much to contribute atm, but I'm interested/support. Would also like to extend Approach 2 to any s3-compliant provider. However, I can also see the merits of Approach 1 for the simplest providers (csv, geojson), which might be hosted at random http URLs

ksonda avatar Dec 09 '21 02:12 ksonda

@Dylan-Pugh, I am interested in doing exactly the same. Have you made any progress on your custom provider extending XarrayEDRProvider? Is there any interest or movement in integrating this capability into the main branch?

sjordan29 avatar Feb 17 '23 19:02 sjordan29

As per RFC4, this Issue has been inactive for 90 days. In order to manage maintenance burden, it will be automatically closed in 7 days.

github-actions[bot] avatar Mar 10 '24 21:03 github-actions[bot]

As per RFC4, this Issue has been closed due to there being no activity for more than 90 days.

github-actions[bot] avatar Mar 24 '24 03:03 github-actions[bot]