intake-xarray icon indicating copy to clipboard operation
intake-xarray copied to clipboard

`to_dask()` not lazy when `simplecache::` in urlpath

Open aaronspring opened this issue 5 years ago • 1 comments

when loading to_dask with caching as in https://github.com/pangeo-data/pangeo-datastore/issues/113, fsspec.open_local first loads the whole dataset and then opens the data in xarray, still with chunks but after having spend the time on downloading.

is there a way to circumvent this in intake-xarray or is this a consequence from fsspec caching that cannot be changed for intake-xarray?

it would be great to just do to_dask() without spending the time to download and only cache when xarray runs compute.

aaronspring avatar Aug 03 '20 16:08 aaronspring

Whilst this may be possible, it would be tricky. Dask wants to open the file to assess the chunking; it could be done on the original file, but only cache it when actually loading, in theory. There is a block-wise cacher in fsspec, which only downloads the parts of a file that are accessed, as they are accessed, but that only works with a library expecting to work with python file-like objects (i.e., there's a reason to call open_local: the library wants a real local file). You could do something with FUSE, where the file looks real to the OS, but uses block-wise chunking internally - this kind of thing I'm pretty sure has never been tried.

martindurant avatar Aug 04 '20 14:08 martindurant