Cloud Data Access Within Providers
I've been experimenting with using AWS S3 as a backing store within a custom provider (extending XarrayEDRProvider), but have been running into a few issues.
I've tried two different approaches so far (opening a NetCDF file stored in S3):
1. Retrieving the file with boto3 and saving to a tempfile
This approach works, but requires a few helper methods, and I can see it becoming a bit clunky going forward:
if provider_def['data'].startswith('s3://'):
tmp = tempfile.NamedTemporaryFile(delete=False)
self.fetch_s3_file(provider_def['data'], tmp.name)
self.data = tmp.name`
code for fetch_s3_file:
def fetch_s3_file(self, object_uri, out_file):
bucket_name, key_name = self.split_s3_uri(object_uri)
s3 = boto3.client('s3')
try:
s3.download_file(bucket_name, key_name, out_file)
except Exception as err:
LOGGER.warning(err)
raise ProviderNoDataError(err)
2. Using fsspec to handle S3 operations
This is my preferred method, but I've been running into some issues. This code fails, and returns the following error: pygeoapi.provider.base.ProviderConnectionError: I/O operation on closed file.
with fsspec.open(S3_URL) as f:
ds = xr.open_dataset(f)
This code (using a work-around mentioned in this s3fs issue) does work, but is extremely slow:
self._data = open_func(fsspec.open(provider_def['data']).open())
Describe the solution you'd like
Ideally I'd like to get fsspec working, and then figure out the best way to integrate/abstract the solution, so we can use data from a number of different cloud services going forward.
Is anyone working on something similar?
I've reviewed these issues, which may provide additional context:
https://github.com/geopython/pygeoapi/issues/724
https://github.com/geopython/pygeoapi/issues/807
I'm also happy to provide more detail - thanks!
Not that I have much to contribute atm, but I'm interested/support. Would also like to extend Approach 2 to any s3-compliant provider. However, I can also see the merits of Approach 1 for the simplest providers (csv, geojson), which might be hosted at random http URLs
@Dylan-Pugh, I am interested in doing exactly the same. Have you made any progress on your custom provider extending XarrayEDRProvider? Is there any interest or movement in integrating this capability into the main branch?
As per RFC4, this Issue has been inactive for 90 days. In order to manage maintenance burden, it will be automatically closed in 7 days.
As per RFC4, this Issue has been closed due to there being no activity for more than 90 days.