xarray 0.18.0 raises ValueError, not FileNotFoundError, when opening a non-existent file
What happened:
In a Python environment with xarray 0.18.0 and python-netcdf4 installed, I called xarray.open_dataset("nonexistent"). (The file "nonexistent" does not exist.) xarray threw a ValueError: cannot guess the engine, try passing one explicitly.
What you expected to happen:
I expected a FileNotFoundError error to be thrown, as in xarray 0.17.0.
Minimal Complete Verifiable Example:
import xarray as xr
xr.open_dataset("nonexistent")
Anything else we need to know?:
This is presumably related to Issue #5295, but is not fixed by PR #5296: ValueError is also thrown with the currently latest commit in master (9165c266).
This change in behaviour produced a hard-to-diagnose bug deep in xcube, where we were catching the FileNotFound exception to deal gracefully with a missing file, but the new ValueError was of course not caught -- and the error message did not make the cause obvious. Catching ValueError is a workaround, but not a great solution since it may also be thrown for files which do exist but don't have a recognizable data format. I suspect that other codebases may be similarly affected.
xarray 0.17.0 was capable of throwing a ValueError for a non-existent file, but only in the (rare?) case that neither netCDF4-python nor scipy was installed.
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:33) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.8.0-53-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.8.0xarray: 0.18.0 pandas: 1.2.4 numpy: 1.20.2 scipy: None netCDF4: 1.5.6 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.4.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None pint: None setuptools: 49.6.0.post20210108 pip: 21.1.1 conda: None pytest: None IPython: None sphinx: None
Thanks for the report - I added that to the list of todos for the 0.18.1 patch release.
I think that It's not a bug: filename_or_obj in open_dataset can be a file, file-like, bytes, URL. The accepted inputs depend on the engine. So it doesn't make sense to raise a FileNotFoundError if the engine is not defined by the user or not automatically detected by xarray.
@aurghs Point taken, but in that case the xarray.open_dataset documentation probably needs updating: currently it says ‘Strings and Path objects are interpreted as a path to a netCDF file or an OpenDAP URL and opened with python-netCDF4, unless the filename ends with .gz…’. In particular I find the behaviour with Path confusing: this type is unambiguously a filesystem path, but still results in a ValueError rather than FileNotFoundError. And in Python 3, str and bytes are distinct (though I suppose a str might still be a URL rather than a file).
In principle, though, I'd be OK with any selection of potential exceptions, as long as they're documented. The current practical problem for us (the xcube developers) is that the behaviour change from 0.17 to 0.18 means that we now have to audit our existing codebases for potential uncaught ValueErrors.
Ouch! I agree this is annoying. And yes, the open_dataset docs definitely need updating.
The general rule is that ValueError is too generic of an exception for it to be appropriate to catch in user code. So indeed, if you need to catch that error something has gone wrong :).
Given this use case, perhaps we should have a special exception type here, something like xarray.NoMatchingEngineError.
Actually, thinking about this a little more, you should get NoMatchingEngineError if you try to open a text file with Xarray.
But if you pass in a local filesystem path and try to open an entirely non-existent file, then FileNotFoundError would be appropriate.
In theory, one could write an open_dataset backend that interprets a local path in an arbitrary way, but that is not the intent of open_dataset, which is intended to handle things like files/directories, other URIs and file contents. If you want to interpret strings very differently you should write a new function for opening.
@shoyer and @pont-us my take is that engine=None is mostly and feature to help new users experiment with xarray. The logic for autodetecting the engine has always been extremely fragile and the result depends on what packages are installed on the system.
I think it is OK to add a dedicated xarray.EngineGuessingError, but I'd not try anything more complex and in general invite users to pass engine explicitly.
Examples of strings that identify a dataset when the right backend is installed, but are not paths and the FileNotFoundError would sound confusing:
-
/vsizip/my.zip/my.tifwithrioxarray -
https://myhost/myfile.ncwithnetcdf4and others
I think it is OK to add a dedicated
xarray.EngineGuessingError, but I'd not try anything more complex and in general invite users to passengineexplicitly.
I'm reluctantly inclined to agree. (But "reluctantly" only because this means that we may have a lot of xcube code to check and update now.) If the engine is explicitly specified, the question of which exceptions to expect becomes a lot easier to handle, since it's now just between the caller and the engine.
Probably the open_dataset documentation should actively discourage reliance on engine guessing for anything but interactive use, and point out that callers need to check the documentation for the selected engine to determine which exceptions (other than EngineGuessingError) might be thrown.
I just accidentally tried to open myfile.h5.h5 (I meant to open myfile.h5) and was confused about the ValueError.
- Should
.h5be added to the list of extensions that map to the NetCDF engine? - Could the error message for the
ValueError(or any of the alternate exception classes discussed above) include text suggesting that an incorrect file path (or other URL?) could be the culprit?