model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True
Hi Maarten
Firstly, thank you for this amazing library. I'm generating topics on newsgroups data for testing and I am using cuML for UMAP and HDBSCAN. I have set the calculate_probabilites = True and performed fit_transform() on the data. It worked fine and gave good results. When I try to run transform() on new data it gives an error AttributeError: 'tuple' object has no attribute 'shape'. When i set calculate_probabilities = False this function works fine.
The libraries i am using are bertopic==0.15.0 cuml-cu11==23.4.1 cudf-cu11==23.4.1 cuda toolkit 11.8
I am running on a virtual ubuntu machine with Tesla T4 GPU.
The code to reproduce this error
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
train = docs[:15000]
test = docs[15000:]
umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)
topics_test, probs_test = topic_model.transform(test)
The error that comes when i run this
Can you please guide me in solving this error.
Perhaps this if block might be able to use cuML's membership_vector function to align with the CPU hdbscan:
https://github.com/MaartenGr/BERTopic/blob/fca5a4f9df149609c7e3458d6b2c421194cea62c/bertopic/cluster/_utils.py#L47-L56
Or, it could perhaps be updated to reflect that approximate_predict returns a tuple of (labels, probabilities) (even if only the probabilities will be returned by the function).
https://github.com/MaartenGr/BERTopic/blob/fca5a4f9df149609c7e3458d6b2c421194cea62c/bertopic/cluster/_utils.py#L22-L23
Ah, it seems indeed that the incorrect function is used there. I believe simply replacing:
from cuml.cluster.hdbscan.prediction import approximate_predict
probabilities = approximate_predict(model, embeddings)
with this should solve the issue:
from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings)
I can fix this in an upcoming release. PRs are also greatly appreciated!
Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error
ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points
After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.
The final code that works for me is
from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings)))
@slice-pranay Awesome, thanks for diving into this! If you want, it would be great if you create a PR for this. Otherwise, I can also add this in the coming weeks when I find some time. Either way, thanks for this!
Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from
approximate_predicttomembership_vectorit gave another errorValueError: batch_size should be in integer that is >= 0 and <= the number of prediction pointsAfter looking into the
membership_vectorfunction incuml.cluster.hdbscan.prediction.pyxfile there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.The final code that works for me is
from cuml.cluster.hdbscan.prediction import membership_vector probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings)))
Thanks for surfacing this issue. When used like this, the batch_size parameter shouldn't be necessary (and shouldn't have any effect). This parameter is designed for the scenario when there is a large amount of data and users may want to potentially slightly trade off performance and higher peak memory requirements (though the default batch size of 4096 is likely the right choice as it significantly reduces peak memory requirements with a very minor impact on performance). It should be doing this under the hood, like it is already for all_points_membership_vectors.
import cuml
X, y = cuml.make_blobs(n_samples=100, n_features=3)
clf = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True).fit(X)
cuml.cluster.hdbscan.all_points_membership_vectors(clf)[:5]
array([[1.0000000e+00, 4.6776744e-40, 4.0108805e-40],
[4.9417980e-02, 5.5743980e-01, 7.2683059e-02],
[4.8842371e-02, 7.2603232e-01, 1.0369291e-01],
[7.5122565e-01, 5.8568917e-02, 5.3385083e-02],
[4.5487583e-02, 1.0042124e-01, 5.8100939e-01]], dtype=float32)
I've filed a cuML issue to track this bug. In the meantime, your suggested workaround makes sense!
For completeness, this membership_vector bug has now been fixed in cuML. It won't be available in the 23.06 stable release that is about to happen, but ~should be available in the 23.08 nightly packages in about 1 hour~ is now available in the 23.08 nightly packages.
Is this actually fixed in cuML 23.08? I have installed cuML using the instructions at https://docs.rapids.ai/install and from cuml import __version__ reports 23.08.00. Running the original poster's code example exactly as-is still produces the AttributeError: 'tuple' object has no attribute 'shape'. Is there something I am missing here?
I'm facing the same issue with cuml 23.10.0 and BERTopic 0.16.0, is there a workaround or fix available?
As of last week, cuML 24.04 is now available. I think it's probably fair to say that almost everyone using cuML with BERTopic is using a version that supports the membership_vector function.
If there's interest and bandwidth from the maintainers to provide reviews, I'm happy to open a PR that resolves this issue and the implicitly equivalent https://github.com/MaartenGr/BERTopic/issues/1764 (essentially, an updated version of this PR)
cc @MaartenGr
@beckernick Thanks, that would be great! This has been open for way too long (which is definitely my fault!), so a PR that updates this to the membership_vector sounds good. I also intend to release a minor version of BERTopic soon with many fixes, so that would be a nice timing to have this included.
Sounds good!
Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.
The original example works with this PR:
from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
train = docs[:15000]
test = docs[15000:]
umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)
topics_test, probs_test = topic_model.transform(test)
pd.Series(topics_test).value_counts()
2024-04-30 23:29:26,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████| 469[/469](http://localhost:8888/469) [00:14<00:00, 31.43it[/s](http://localhost:8888/s)]
2024-04-30 23:29:42,841 - BERTopic - Embedding - Completed ✓
2024-04-30 23:29:42,842 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-30 23:29:43,006 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:43,008 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-30 23:29:43,170 - BERTopic - Cluster - Completed ✓
2024-04-30 23:29:43,175 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-30 23:29:46,567 - BERTopic - Representation - Completed ✓
Batches: 100%|█████████████████████████████████████████████████████████| 121[/121](http://localhost:8888/121) [00:03<00:00, 30.64it[/s](http://localhost:8888/s)]
2024-04-30 23:29:51,410 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-04-30 23:29:51,431 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:51,432 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-04-30 23:29:51,439 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-04-30 23:29:51,446 - BERTopic - Probabilities - Completed ✓
2024-04-30 23:29:51,447 - BERTopic - Cluster - Completed ✓
0 1176
-1 551
1 390
2 362
4 221
3 190
5 157
6 155
7 131
8 122
9 95
10 66
11 57
12 42
13 42
14 40
15 20
17 17
16 12
Name: count, dtype: int64
@beckernick
Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.
That is all too familiar these days! So thanks for taking the time to create the PR. When it passes, I'll go ahead and merge it in preparation for a minor release.