scikit-learn icon indicating copy to clipboard operation
scikit-learn copied to clipboard

Using `NearestNeighbors` with `p < 1` and floats raises an error

Open catanzaromj opened this issue 2 years ago • 5 comments

Describe the bug

Using NearestNeighbors with p < 1 raises an error if the array X contains floats. It does not seem to raise errors if X consists of integers.

This was originally discussed in https://github.com/scikit-learn/scikit-learn/discussions/26536

For example, this is fine:

from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[1,0], [0,0], [0,1]])
neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
neigh.fit(X)
neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)

I would expect this behavior whether X consists of floats or integers.

Steps/Code to Reproduce

from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[1.0,0.0], [0.0,0.0], [0.0,1.0]])
neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
neigh.fit(X)
neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)

Expected Results

array([array([0, 1, 2])], dtype=object)

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[40], line 6
      4 neigh = NearestNeighbors(algorithm='brute',metric_params={'p':0.5})
      5 neigh.fit(X)
----> 6 neigh.radius_neighbors(X[0].reshape(1,-1), radius=4, return_distance=False)

File ~/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/sklearn/neighbors/_base.py:1161, in RadiusNeighborsMixin.radius_neighbors(self, X, radius, return_distance, sort_results)
   1153 use_pairwise_distances_reductions = (
   1154     self._fit_method == "brute"
   1155     and RadiusNeighbors.is_usable_for(
   1156         X if X is not None else self._fit_X, self._fit_X, self.effective_metric_
   1157     )
   1158 )
   1160 if use_pairwise_distances_reductions:
-> 1161     results = RadiusNeighbors.compute(
   1162         X=X,
   1163         Y=self._fit_X,
   1164         radius=radius,
   1165         metric=self.effective_metric_,
   1166         metric_kwargs=self.effective_metric_params_,
   1167         strategy="auto",
   1168         return_distance=return_distance,
   1169         sort_results=sort_results,
   1170     )
   1172 elif (
   1173     self._fit_method == "brute" and self.metric == "precomputed" and issparse(X)
   1174 ):
   1175     results = _radius_neighbors_from_graph(
   1176         X, radius=radius, return_distance=return_distance
   1177     )

File ~/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:421, in RadiusNeighbors.compute(cls, X, Y, radius, metric, chunk_size, metric_kwargs, strategy, return_distance, sort_results)
    335 """Return the results of the reduction for the given arguments.
    336 
    337 Parameters
   (...)
    418 returns.
    419 """
    420 if X.dtype == Y.dtype == np.float64:
--> 421     return RadiusNeighbors64.compute(
    422         X=X,
    423         Y=Y,
    424         radius=radius,
    425         metric=metric,
    426         chunk_size=chunk_size,
    427         metric_kwargs=metric_kwargs,
    428         strategy=strategy,
    429         sort_results=sort_results,
    430         return_distance=return_distance,
    431     )
    433 if X.dtype == Y.dtype == np.float32:
    434     return RadiusNeighbors32.compute(
    435         X=X,
    436         Y=Y,
   (...)
    443         return_distance=return_distance,
    444     )

File sklearn/metrics/_pairwise_distances_reduction/_radius_neighbors.pyx:110, in sklearn.metrics._pairwise_distances_reduction._radius_neighbors.RadiusNeighbors64.compute()

File sklearn/metrics/_pairwise_distances_reduction/_datasets_pair.pyx:87, in sklearn.metrics._pairwise_distances_reduction._datasets_pair.DatasetsPair64.get_for()

File sklearn/metrics/_dist_metrics.pyx:285, in sklearn.metrics._dist_metrics.DistanceMetric.get_metric()

File sklearn/metrics/_dist_metrics.pyx:1252, in sklearn.metrics._dist_metrics.MinkowskiDistance.__init__()

ValueError: p must be greater than 1

Versions

System:
    python: 3.9.16 (main, Mar  8 2023, 14:00:05)  [GCC 11.2.0]
executable: /home/XYX/miniconda3/envs/dimmer_env/bin/python
   machine: Linux-5.15.0-69-generic-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.2
          pip: 23.0.1
   setuptools: 67.8.0
        numpy: 1.24.3
        scipy: 1.10.1
       Cython: None
       pandas: 2.0.1
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/numpy.libs/libopenblas64_p-r0-15028c96.3.21.so
        version: 0.3.21
threading_layer: pthreads
   architecture: Zen
    num_threads: 48

       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen
    num_threads: 48

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/XYX/miniconda3/envs/dimmer_env/lib/python3.9/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0
        version: None
    num_threads: 48

catanzaromj avatar Jun 08 '23 17:06 catanzaromj

This issue arises due to the different underlying computations when 'X' is of integer or float type. It looks like, when X is integer type pairwise_distances_chunked is used for computation and when X is[np.float32, np.float64]pairwise_distances_reductions are used which gets the MinkowskiDistance64 metric from cython module. MinkowskiDistance64 class raises value error when P < 1 when it is constructed.

Shreesha3112 avatar Jun 14 '23 13:06 Shreesha3112

I think should allow p < 1 for all ways to compute the Minkowski-based neighbors as long as we use an exhaustive search method (such as algorithm="brute") that does not rely on the triangular equality.

Feel free to open a PR.

ogrisel avatar Jun 16 '23 15:06 ogrisel

Out of curiosity, what is your use case for this combination of hyper-parameters?

ogrisel avatar Jun 16 '23 15:06 ogrisel

@ogrisel I was trying to explore what neighbors look like for high-dimensional datasets, and modifying the metric for that computation. I noticed that p<1 was implemented recently in https://github.com/scikit-learn/scikit-learn/pull/24750 (understanding that these aren't actually metrics), which is why I chose algorithm="brute".

I'm happy to open or contribute to a PR on this issue.

catanzaromj avatar Jun 16 '23 22:06 catanzaromj

@catanzaromj Are you working on this? If not, I will open PR.

Shreesha3112 avatar Jun 22 '23 06:06 Shreesha3112

@Shreesha3112 I am not but I just realized it may be getting fixed on #26568

catanzaromj avatar Jun 30 '23 14:06 catanzaromj

@catanzaromj I think https://github.com/scikit-learn/scikit-learn/pull/26568 deals specifically with parameter validation for NearestNeighbors. This issue still requires a PR from what I can see.

Shreesha3112 avatar Jul 04 '23 07:07 Shreesha3112

P should be greater than for Minkowski distance metric and also you can use value p = 1 for Manhattan distance and p = 2 for Euclidean Distance and here is the output when I've used p = 1.2 and the result is as per your expectation.

Github

ChandraPrakash-Bathula avatar Jul 06 '23 18:07 ChandraPrakash-Bathula