intake-xarray icon indicating copy to clipboard operation
intake-xarray copied to clipboard

OSError: [Errno 24] Too many open files

Open Eisbrenner opened this issue 5 years ago • 9 comments

Using Intake (intake-xarray) on a larger number of files, e.g. daily data for one or two decades, results in a too-many-files error.

Loading the same set of data with xarray.open_mfdatasets works just fine.

Versions:

python 3.8.2
xarray 0.16.1
intake 0.6.0
intake-xarray 0.4.0

For me, the total number of files was: 9464

The Intake catalog I have looks something like this:

metadata:
  version: 1

plugins:
  source:
    - module: intake_xarray

sources:
  daily_mean:
    driver: netcdf
    args:
      urlpath: "{{ env(HOME) }}/path/to/data*.nc"
    xarray_kwargs:
      combine: by_coords
      parallel: True

then, using

intake.open_catalog(path_to_catalog)["daily_mean"].to_dask().chunk({"time": -1, "longitude": 10, "latitude":10})

throws an error of the form of

OSError: [Errno 24] Too many open files: 'path/to/catalog.yml'

loading the same data with

xr.open_mfdataset("~/path/to/data*.nc", combine="by_coords", parallel=True).chunk({"time": -1, "longitude": 10, "latitude":10})

works just fine.

Eisbrenner avatar Dec 02 '20 09:12 Eisbrenner

I don't immediately know the reason, but if this is running on macOS, the files-open limit is pretty low by default. You can do something like ulimit -n 40096.

martindurant avatar Dec 02 '20 16:12 martindurant

@Eisbrenner - could you please try running the same code on intake-xarray master? pip install git+https://[email protected]/intake/intake-xarray.git@master

scottyhq avatar Dec 02 '20 17:12 scottyhq

@martindurant i've seen the work around via increased ulimit; however, I think the behavior was changed directly in xarray and the limit should not be breached anymore. I thought it will be beneficial to intake to be aware of this too, thus I thought i'd share this regardless.

@scottyhq I'll check the master branch tomorrow!

Eisbrenner avatar Dec 02 '20 23:12 Eisbrenner

with version intake-xarray 0.4.0+23.g2f4bfb3 I still get the same error. Also, I might add that my ulimit is in fact below the file count.

Eisbrenner avatar Dec 03 '20 07:12 Eisbrenner

Is the error happening during a compute(), or while creating the xarray object?

martindurant avatar Dec 03 '20 18:12 martindurant

while creating the object; this (below) is all I'm doing, maybe there is something in the few lines of code which I'm not aware of what it does. I'm still trying to get my head around some of this, for example Dask in general.

  1. the xarray.open_mfdataset variant
data = (
    xr.open_mfdataset(
        "/path/to/data/metoffice_foam1_amm7_NWS_SAL_dm*.nc",
        parallel=True,
        combine="by_coords",
    )
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)
  1. the intake.open_catalog variant; note, the catalog is as above in the initial post here.
data = (
    intake.open_catalog("/path/to/catalog/copernicus-reanalysis.yml")["daily_mean"]
    .to_dask()
    .chunk({"time": -1, "longitude": -1, "latitude": -1})
    .rename({"so": "salinity"})
).sel(depth=0)

the error occurs from this command.

Eisbrenner avatar Dec 04 '20 15:12 Eisbrenner

Can you please compare intake.open_catalog(path_to_catalog)["daily_mean"].to_dask() versus xr.open_mfdataset(...) ? I assume they are not identical - if it completes at all.

martindurant avatar Dec 04 '20 15:12 martindurant

data = intake.open_catalog(path_to_catalog)["daily_mean"].to_dask()
# [...]
# [...].venv/lib/python3.8/site-packages/fsspec/implementations/local.py in _open(self)
# OSError: [Errno 24] Too many open files: '[...]'
data = xr.open_mfdataset(path_to_data, parallel=True, combine="by_coords")
type(data)
# xarray.core.dataset.Dataset

I'll quickly check what the output is with a small enough set of files.

EDIT:

data = intake.open_catalog(path_to_catalog)["test_daily_mean"].to_dask()
type(data)
# xarray.core.dataset.Dataset

here "test_daily_mean" is just a subset of the files, e.g. "/path/to/data/files_199*.nc" instead of "/path/to/data/files_*.nc"

Eisbrenner avatar Dec 04 '20 16:12 Eisbrenner

In NetCDFSource._open_dataset

        if self._can_be_local:
            url = fsspec.open_local(self.urlpath, **self.storage_options)
        else:
            # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
            url = fsspec.open(self.urlpath, **self.storage_options).open()

and local files are held open. Perhaps it would make sense to explicitly check for URLs that are already local, and pass them straight to xarray and let if do the opening of things in that case.

martindurant avatar Dec 04 '20 16:12 martindurant