intake-xarray icon indicating copy to clipboard operation
intake-xarray copied to clipboard

recursively combine items from catalog using xarray.auto_combine

Open rabernat opened this issue 7 years ago • 5 comments

Xarray has a function called auto_combine which takes several datasets and combines them into one, using a series of heuristics to figure out the best way to merge. This is used internally in open_mfdataset.

It would be quite cool if I could take an intake catalog containing many xarray datasets in a hierarchy and open any point of this hierarchy into a single xarray dataset.

Sorry for the vague issue, but I just wanted to jot this idea down before I forgot it. Could be very useful in multiple contexts (e.g. https://github.com/NCAR/intake-siphon/pull/2).

rabernat avatar Jan 20 '19 16:01 rabernat

I believe the AliasSource might be a good model to derive from for this, where you pass the original catalogue plus some parameters to the CombinedXarraySource (or whatever) and it instantiates the xarrays from the cat for each input parameter and then calls auto_combine.

martindurant avatar Jan 23 '19 15:01 martindurant

@martindurant, I'm afraid I can't understand how to use AliasSource. Could you give an example?

rabernat avatar Feb 22 '19 21:02 rabernat

I'll try to knock something up for you, @rabernat . However, were you expecting a source which could

  • make a combined xarray out of several that have already been defined in a catalog; perhaps all the entries in a catalog matching some name pattern or condition, or
  • something that can automatically combine all sources, say, within a thredds server matching a path or other condition
  • something else?

martindurant avatar Feb 22 '19 21:02 martindurant

Sorry this has slipped through the net. Is there still a need here? I see that in some intake-related repos, there are already ways to combine xarray-compatible datasets, but maybe we still want something in intake-xarray itself.

martindurant avatar Apr 28 '19 21:04 martindurant

This is of interest to me also.

We have several datasets comprised of hourly output of nd-gridded data in netCDF format (10's of thousands of files).

I have played with defining the urlpath with parameters, which works will to open one time-point. Using urlpath with a glob pattern also works, ultimately calling open_mfdataset to get all the metadata.

I have looked through the code of intake-esm to try and get a better understanding of how that works too.

The metadata (aside from the time coordinate) is consistent across the files, so theoretically I should be able to read the first file and infer from the filenames the complete stack. But I am failing to understand how to implement this in Intake and would appreciate some pointers.

From a user perspective the funcationality I am going for is:

cat = intake.Catalog('catalog.yaml')
ds=ca['dataset'].to_xarray()
cropped=ds.sel(lon=slice(110,113),lat=slice(-33,-30),time=slice('15-1-2019','15-2-2019')
cropped.to_netcdf(...)

Should I write a plugin using the DataSourceMixin? Im trying to work out how the file paths are mapped into the metadata for the concatenated (time in my case) coordinate.

Thanks for any suggesttions

pbranson avatar Jul 17 '19 11:07 pbranson