OSError: [Errno 24] Too many open files
Using Intake (intake-xarray) on a larger number of files, e.g. daily data for one or two decades, results in a too-many-files error.
Loading the same set of data with xarray.open_mfdatasets works just fine.
Versions:
python 3.8.2
xarray 0.16.1
intake 0.6.0
intake-xarray 0.4.0
For me, the total number of files was: 9464
The Intake catalog I have looks something like this:
metadata:
version: 1
plugins:
source:
- module: intake_xarray
sources:
daily_mean:
driver: netcdf
args:
urlpath: "{{ env(HOME) }}/path/to/data*.nc"
xarray_kwargs:
combine: by_coords
parallel: True
then, using
intake.open_catalog(path_to_catalog)["daily_mean"].to_dask().chunk({"time": -1, "longitude": 10, "latitude":10})
throws an error of the form of
OSError: [Errno 24] Too many open files: 'path/to/catalog.yml'
loading the same data with
xr.open_mfdataset("~/path/to/data*.nc", combine="by_coords", parallel=True).chunk({"time": -1, "longitude": 10, "latitude":10})
works just fine.
I don't immediately know the reason, but if this is running on macOS, the files-open limit is pretty low by default. You can do something like ulimit -n 40096.
@Eisbrenner - could you please try running the same code on intake-xarray master? pip install git+https://[email protected]/intake/intake-xarray.git@master
@martindurant i've seen the work around via increased ulimit; however, I think the behavior was changed directly in xarray and the limit should not be breached anymore. I thought it will be beneficial to intake to be aware of this too, thus I thought i'd share this regardless.
@scottyhq I'll check the master branch tomorrow!
with version intake-xarray 0.4.0+23.g2f4bfb3 I still get the same error. Also, I might add that my ulimit is in fact below the file count.
Is the error happening during a compute(), or while creating the xarray object?
while creating the object; this (below) is all I'm doing, maybe there is something in the few lines of code which I'm not aware of what it does. I'm still trying to get my head around some of this, for example Dask in general.
- the
xarray.open_mfdatasetvariant
data = (
xr.open_mfdataset(
"/path/to/data/metoffice_foam1_amm7_NWS_SAL_dm*.nc",
parallel=True,
combine="by_coords",
)
.chunk({"time": -1, "longitude": -1, "latitude": -1})
.rename({"so": "salinity"})
).sel(depth=0)
- the
intake.open_catalogvariant; note, the catalog is as above in the initial post here.
data = (
intake.open_catalog("/path/to/catalog/copernicus-reanalysis.yml")["daily_mean"]
.to_dask()
.chunk({"time": -1, "longitude": -1, "latitude": -1})
.rename({"so": "salinity"})
).sel(depth=0)
the error occurs from this command.
Can you please compare intake.open_catalog(path_to_catalog)["daily_mean"].to_dask() versus xr.open_mfdataset(...) ? I assume they are not identical - if it completes at all.
data = intake.open_catalog(path_to_catalog)["daily_mean"].to_dask()
# [...]
# [...].venv/lib/python3.8/site-packages/fsspec/implementations/local.py in _open(self)
# OSError: [Errno 24] Too many open files: '[...]'
data = xr.open_mfdataset(path_to_data, parallel=True, combine="by_coords")
type(data)
# xarray.core.dataset.Dataset
I'll quickly check what the output is with a small enough set of files.
EDIT:
data = intake.open_catalog(path_to_catalog)["test_daily_mean"].to_dask()
type(data)
# xarray.core.dataset.Dataset
here "test_daily_mean" is just a subset of the files, e.g. "/path/to/data/files_199*.nc" instead of "/path/to/data/files_*.nc"
In NetCDFSource._open_dataset
if self._can_be_local:
url = fsspec.open_local(self.urlpath, **self.storage_options)
else:
# https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
url = fsspec.open(self.urlpath, **self.storage_options).open()
and local files are held open. Perhaps it would make sense to explicitly check for URLs that are already local, and pass them straight to xarray and let if do the opening of things in that case.