Using `NearestNeighbors` with `p < 1` and floats raises an error
Describe the bug
Using NearestNeighbors with p < 1 raises an error if the array X contains floats. It does not seem to raise errors if X consists of integers.
This was originally discussed in https://github.com/scikit-learn/scikit-learn/discussions/26536
For example, this is fine:
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[1,0], [0,0], [0,1]])
neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
neigh.fit(X)
neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)
I would expect this behavior whether X consists of floats or integers.
Steps/Code to Reproduce
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[1.0,0.0], [0.0,0.0], [0.0,1.0]])
neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
neigh.fit(X)
neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)
Expected Results
array([array([0, 1, 2])], dtype=object)
Actual Results
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[40], line 6
4 neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
5 neigh.fit(X)
----> 6 neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)
File ~/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/sklearn/neighbors/_base.py:1161, in RadiusNeighborsMixin.radius_neighbors(self, X, radius, return_distance, sort_results)
1153 use_pairwise_distances_reductions = (
1154 self._fit_method == "brute"
1155 and RadiusNeighbors.is_usable_for(
1156 X if X is not None else self._fit_X, self._fit_X, self.effective_metric_
1157 )
1158 )
1160 if use_pairwise_distances_reductions:
-> 1161 results = RadiusNeighbors.compute(
1162 X=X,
1163 Y=self._fit_X,
1164 radius=radius,
1165 metric=self.effective_metric_,
1166 metric_kwargs=self.effective_metric_params_,
1167 strategy="auto",
1168 return_distance=return_distance,
1169 sort_results=sort_results,
1170 )
1172 elif (
1173 self._fit_method == "brute" and self.metric == "precomputed" and issparse(X)
1174 ):
1175 results = _radius_neighbors_from_graph(
1176 X, radius=radius, return_distance=return_distance
1177 )
File ~/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:421, in RadiusNeighbors.compute(cls, X, Y, radius, metric, chunk_size, metric_kwargs, strategy, return_distance, sort_results)
335 """Return the results of the reduction for the given arguments.
336
337 Parameters
(...)
418 returns.
419 """
420 if X.dtype == Y.dtype == np.float64:
--> 421 return RadiusNeighbors64.compute(
422 X=X,
423 Y=Y,
424 radius=radius,
425 metric=metric,
426 chunk_size=chunk_size,
427 metric_kwargs=metric_kwargs,
428 strategy=strategy,
429 sort_results=sort_results,
430 return_distance=return_distance,
431 )
433 if X.dtype == Y.dtype == np.float32:
434 return RadiusNeighbors32.compute(
435 X=X,
436 Y=Y,
(...)
443 return_distance=return_distance,
444 )
File sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx:110, in sklearn.metrics._pairwise_distances_reduction._radius_neighbors.RadiusNeighbors64.compute()
File sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx:87, in sklearn.metrics._pairwise_distances_reduction._datasets_pair.DatasetsPair64.get_for()
File sklearn/metrics/_dist_metrics.pyx:285, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric()
File sklearn/metrics/_dist_metrics.pyx:1252, in sklearn.metrics._dist_metrics.MinkowskiDistance.__init__()
ValueError: p must be greater than 1
Versions
System:
python: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0]
executable: /home/XYX/miniconda3/envs/dimmer_env/bin/python
machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python dependencies:
sklearn: 1.2.2
pip: 23.0.1
setuptools: 67.8.0
numpy: 1.24.3
scipy: 1.10.1
Cython: None
pandas: 2.0.1
matplotlib: 3.7.1
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
version: 0.3.21
threading_layer: pthreads
architecture: Zen
num_threads: 48
user_api: blas
internal_api: openblas
prefix: libopenblas
filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
version: 0.3.18
threading_layer: pthreads
architecture: Zen
num_threads: 48
user_api: openmp
internal_api: openmp
prefix: libgomp
filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
version: None
num_threads: 48
This issue arises due to the different underlying computations when 'X' is of integer or float type.
It looks like, when X is integer type pairwise_distances_chunked is used for computation and when X is[np.float32, np.float64]pairwise_distances_reductions are used which gets the MinkowskiDistance64 metric from cython module. MinkowskiDistance64 class raises value error when P < 1 when it is constructed.
I think should allow p < 1 for all ways to compute the Minkowski-based neighbors as long as we use an exhaustive search method (such as algorithm="brute") that does not rely on the triangular equality.
Feel free to open a PR.
Out of curiosity, what is your use case for this combination of hyper-parameters?
@ogrisel I was trying to explore what neighbors look like for high-dimensional datasets, and modifying the metric for that computation. I noticed that p<1 was implemented recently in https://github.com/scikit-learn/scikit-learn/pull/24750 (understanding that these aren't actually metrics), which is why I chose algorithm="brute".
I'm happy to open or contribute to a PR on this issue.
@catanzaromj Are you working on this? If not, I will open PR.
@Shreesha3112 I am not but I just realized it may be getting fixed on #26568
@catanzaromj I think https://github.com/scikit-learn/scikit-learn/pull/26568 deals specifically with parameter validation for NearestNeighbors. This issue still requires a PR from what I can see.
P should be greater than for Minkowski distance metric and also you can use value p = 1 for Manhattan distance and p = 2 for Euclidean Distance and here is the output when I've used p = 1.2 and the result is as per your expectation.