Interpolation behaviour inconsistent with numpy?
Hey all,
When running dataset.interp(time=dataset.time) fills with np.nan if one of the neighbor is a np.nan even when interpolation is not actually needed.
Here is the sample code to reproduce the issue :
def test_crop_times_nan() :
ds = xr.Dataset(
data_vars = {
"some_variable" : (['x', 'time'], np.array([[np.nan, 0, 1]]))
},
coords = {
"time" : np.array([0,1,2])
}
)
result = ds.interp(time=ds.time)
# result["some_variable"].value == [nan, nan, 1.0]
# whereas [nan, 0, 1.0] is EXPECTED
xr.testing.assert_allclose(ds, result)
Please note that numpy does not have the same behavior :
>>> import numpy as np
>>> np.interp([0,1,2], xp=[0,1,2], fp=[np.nan,0,1])
array([nan, 0., 1.])
Is that an intended behaviour for xarray?
If so, does this mean that I first have to check if an interpolation is needed instead of doing it no matter what (and use reindex instead of interp if it is not needed) ?
(this will be kind of tricky if interpolation is needed for certain values and some not...)
Thanks for your help ;)
Environment:
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.8.5 (default, Jul 28 2020, 12:59:40) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.8.0-7642-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.0 libnetcdf: 4.7.4xarray: 0.18.2 pandas: 1.2.4 numpy: 1.19.4 scipy: 1.6.0 netCDF4: 1.5.6 pydap: None h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: None cftime: 1.3.0 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2021.01.0 distributed: 2021.01.0 matplotlib: 3.4.2 cartopy: None seaborn: None numbagg: None pint: None setuptools: 57.4.0 pip: 20.2.4 conda: None pytest: None IPython: 7.19.0 sphinx: None
Thanks @mathisc we use scipy by default and that's what it does...
In [6]: from scipy.interpolate import interp1d
In [7]: f = sp.interpolate.interp1d([0, 1, 2], [np.nan, 0, 1])
In [8]: f([0, 1, 2])
Out[8]: array([nan, nan, 1.])
Can you file an issue over at scipy?
EDIT: I guess we could skip interpolating when output coordinate values are a subset of the input coordinate values.
There does not seem a way to force the numpy interpolator for the 1D case:
https://github.com/pydata/xarray/blob/2f8623d4d9929b61ff4ebfd5232c12474420dfae/xarray/core/missing.py#L472
(that could be a work around)
Thanks, Here is the ticket on Scipy's bug tracker : https://github.com/scipy/scipy/issues/14531
FWI this is what I ended up doing (which feels like work in both cases, interpolation needed or not):
interpolated_ds = ds.interp(time=ds.times)
reindexed_ds = ds.reindex(time=ds.times)
ds = interpolated_ds.fillna(reindexed_ds)
It is probably not the most efficient in terms of operation but it gets the job done.
Thanks for your feedback ;)
Interestingly this was fixed upstream according to the linked issue above, but the behavior persists in Xarray:
ds.interp(time=ds.time)
# array([[nan, nan, 1.]]) # (unexpected)
ds.interpolate_na(time=ds.time)
# array([nan, 0., 1.])
from scipy.interpolate import interp1d
f = interp1d([0, 1, 2], [np.nan, 0, 1])
f([0, 1, 2])
# array([nan, 0., 1.])
xarray: 2024.2.0
pandas: 2.2.1
numpy: 1.26.3
scipy: 1.12.0