from polyfuzz import PolyFuzz
Although I was able to use PolyFuzz once for some of your basic example code, once I tried messing around with Embeddings or Bert, the entire package broke. It seems to have to do with differing numpy version compatibilities. Currently, if I do a basic
pip install polyfuzz
followed by
from polyfuzz import PolyFuzz
I get the following error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [63], in <cell line: 1>()
----> 1 from polyfuzz import PolyFuzz
File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/__init__.py:1, in <module>
----> 1 from .polyfuzz import PolyFuzz
2 __version__ = "0.3.2"
File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/polyfuzz.py:7, in <module>
5 from polyfuzz.linkage import single_linkage
6 from polyfuzz.utils import check_matches, check_grouped, create_logger
----> 7 from polyfuzz.models import TFIDF, RapidFuzz, Embeddings, BaseMatcher
8 from polyfuzz.metrics import precision_recall_curve, visualize_precision_recall
10 logger = create_logger()
File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/__init__.py:4, in <module>
2 from ._distance import EditDistance
3 from ._rapidfuzz import RapidFuzz
----> 4 from ._tfidf import TFIDF
5 from ._utils import cosine_similarity
7 from polyfuzz.error import NotInstalled
File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/_tfidf.py:7, in <module>
4 from typing import List, Tuple
5 from sklearn.feature_extraction.text import TfidfVectorizer
----> 7 from ._utils import cosine_similarity
8 from ._base import BaseMatcher
11 class TFIDF(BaseMatcher):
File /opt/conda/envs/vespid/lib/python3.9/site-packages/polyfuzz/models/_utils.py:9, in <module>
6 from sklearn.metrics.pairwise import cosine_similarity as scikit_cosine_similarity
8 try:
----> 9 from sparse_dot_topn import awesome_cossim_topn
10 _HAVE_SPARSE_DOT = True
11 except ImportError:
File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/__init__.py:5, in <module>
2 import sys
4 if sys.version_info[0] >= 3:
----> 5 from sparse_dot_topn.awesome_cossim_topn import awesome_cossim_topn
6 else:
7 from awesome_cossim_topn import awesome_cossim_topn
File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/awesome_cossim_topn.py:7, in <module>
4 from scipy.sparse import isspmatrix_csr
6 if sys.version_info[0] >= 3:
----> 7 from sparse_dot_topn import sparse_dot_topn as ct
8 from sparse_dot_topn import sparse_dot_topn_threaded as ct_thread
9 else:
File /opt/conda/envs/vespid/lib/python3.9/site-packages/sparse_dot_topn/sparse_dot_topn.pyx:1, in init sparse_dot_topn.sparse_dot_topn()
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Following some StackOverflow posts, I tried installing differing versions of numpy, but in the end, something is always unhappy, and somehow I can no longer use PolyFuzz no matter what I do. It would be great if it would work with the latest version of numpy, or if at least one version definitely worked reliably! Thanks for looking into this.
I eventually got this working by reinstalling hdbscan! Very strange.
I eventually got this working by reinstalling hdbscan! Very strange.
Glad to hear that it worked out! This used to be an issue with versions <0.28.0 of HDBSCAN as it did not use oldest-supported-numpy before to match ABI. Making sure you have the newest version of HDBSCAN, also in future instances, will prevent this.