BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True

Open slice-pranay opened this issue 2 years ago • 12 comments

Hi Maarten

Firstly, thank you for this amazing library. I'm generating topics on newsgroups data for testing and I am using cuML for UMAP and HDBSCAN. I have set the calculate_probabilites = True and performed fit_transform() on the data. It worked fine and gave good results. When I try to run transform() on new data it gives an error AttributeError: 'tuple' object has no attribute 'shape'. When i set calculate_probabilities = False this function works fine.

The libraries i am using are bertopic==0.15.0 cuml-cu11==23.4.1 cudf-cu11==23.4.1 cuda toolkit 11.8

I am running on a virtual ubuntu machine with Tesla T4 GPU.

The code to reproduce this error

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)

The error that comes when i run this Screenshot 2023-06-02 at 5 26 26 PM

Can you please guide me in solving this error.

slice-pranay avatar Jun 02 '23 12:06 slice-pranay

Perhaps this if block might be able to use cuML's membership_vector function to align with the CPU hdbscan:

https://github.com/MaartenGr/BERTopic/blob/fca5a4f9df149609c7e3458d6b2c421194cea62c/bertopic/cluster/_utils.py#L47-L56

Or, it could perhaps be updated to reflect that approximate_predict returns a tuple of (labels, probabilities) (even if only the probabilities will be returned by the function).

https://github.com/MaartenGr/BERTopic/blob/fca5a4f9df149609c7e3458d6b2c421194cea62c/bertopic/cluster/_utils.py#L22-L23

beckernick avatar Jun 02 '23 19:06 beckernick

Ah, it seems indeed that the incorrect function is used there. I believe simply replacing:

from cuml.cluster.hdbscan.prediction import approximate_predict 
probabilities = approximate_predict(model, embeddings) 

with this should solve the issue:

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings) 

I can fix this in an upcoming release. PRs are also greatly appreciated!

MaartenGr avatar Jun 03 '23 04:06 MaartenGr

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error

ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points

After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) 

slice-pranay avatar Jun 05 '23 06:06 slice-pranay

@slice-pranay Awesome, thanks for diving into this! If you want, it would be great if you create a PR for this. Otherwise, I can also add this in the coming weeks when I find some time. Either way, thanks for this!

MaartenGr avatar Jun 05 '23 07:06 MaartenGr

Thank you @MaartenGr this change alone with another change solved the problem. By just replacing the function from approximate_predict to membership_vector it gave another error

ValueError: batch_size should be in integer that is >= 0 and <= the number of prediction points

After looking into the membership_vector function in cuml.cluster.hdbscan.prediction.pyx file there is another parameter batch_size which is set to a default value of 4096. There is a check missing in that function to update this value to the size of the embeddings if its less than 4096. So adding this check in the function call itself solved this issue.

The final code that works for me is

from cuml.cluster.hdbscan.prediction import membership_vector
probabilities = membership_vector(model, embeddings, batch_size=min(4096, len(embeddings))) 

Thanks for surfacing this issue. When used like this, the batch_size parameter shouldn't be necessary (and shouldn't have any effect). This parameter is designed for the scenario when there is a large amount of data and users may want to potentially slightly trade off performance and higher peak memory requirements (though the default batch size of 4096 is likely the right choice as it significantly reduces peak memory requirements with a very minor impact on performance). It should be doing this under the hood, like it is already for all_points_membership_vectors.

import cuml

X, y = cuml.make_blobs(n_samples=100, n_features=3)

clf = cuml.cluster.hdbscan.HDBSCAN(prediction_data=True).fit(X)
cuml.cluster.hdbscan.all_points_membership_vectors(clf)[:5]
array([[1.0000000e+00, 4.6776744e-40, 4.0108805e-40],
       [4.9417980e-02, 5.5743980e-01, 7.2683059e-02],
       [4.8842371e-02, 7.2603232e-01, 1.0369291e-01],
       [7.5122565e-01, 5.8568917e-02, 5.3385083e-02],
       [4.5487583e-02, 1.0042124e-01, 5.8100939e-01]], dtype=float32)

I've filed a cuML issue to track this bug. In the meantime, your suggested workaround makes sense!

beckernick avatar Jun 05 '23 13:06 beckernick

For completeness, this membership_vector bug has now been fixed in cuML. It won't be available in the 23.06 stable release that is about to happen, but ~should be available in the 23.08 nightly packages in about 1 hour~ is now available in the 23.08 nightly packages.

beckernick avatar Jun 06 '23 20:06 beckernick

Is this actually fixed in cuML 23.08? I have installed cuML using the instructions at https://docs.rapids.ai/install and from cuml import __version__ reports 23.08.00. Running the original poster's code example exactly as-is still produces the AttributeError: 'tuple' object has no attribute 'shape'. Is there something I am missing here?

HeadCase avatar Sep 08 '23 14:09 HeadCase

I'm facing the same issue with cuml 23.10.0 and BERTopic 0.16.0, is there a workaround or fix available?

nilsblessing avatar Dec 07 '23 18:12 nilsblessing

As of last week, cuML 24.04 is now available. I think it's probably fair to say that almost everyone using cuML with BERTopic is using a version that supports the membership_vector function.

If there's interest and bandwidth from the maintainers to provide reviews, I'm happy to open a PR that resolves this issue and the implicitly equivalent https://github.com/MaartenGr/BERTopic/issues/1764 (essentially, an updated version of this PR)

cc @MaartenGr

beckernick avatar Apr 15 '24 20:04 beckernick

@beckernick Thanks, that would be great! This has been open for way too long (which is definitely my fault!), so a PR that updates this to the membership_vector sounds good. I also intend to release a minor version of BERTopic soon with many fixes, so that would be a nice timing to have this included.

MaartenGr avatar Apr 18 '24 14:04 MaartenGr

Sounds good!

beckernick avatar Apr 18 '24 18:04 beckernick

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

The original example works with this PR:

from bertopic import BERTopic
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

train = docs[:15000]
test = docs[15000:]

umap_model = UMAP(n_components=5, n_neighbors=10, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=25, min_cluster_size=50, gen_min_span_tree=True, prediction_data = True)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model, calculate_probabilities=True, verbose=True)
topics,probs = topic_model.fit_transform(train)

topics_test, probs_test = topic_model.transform(test)
pd.Series(topics_test).value_counts()

2024-04-30 23:29:26,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|█████████████████████████████████████████████████████████| 469[/469](http://localhost:8888/469) [00:14<00:00, 31.43it[/s](http://localhost:8888/s)]
2024-04-30 23:29:42,841 - BERTopic - Embedding - Completed ✓
2024-04-30 23:29:42,842 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-04-30 23:29:43,006 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:43,008 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-04-30 23:29:43,170 - BERTopic - Cluster - Completed ✓
2024-04-30 23:29:43,175 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-04-30 23:29:46,567 - BERTopic - Representation - Completed ✓
Batches: 100%|█████████████████████████████████████████████████████████| 121[/121](http://localhost:8888/121) [00:03<00:00, 30.64it[/s](http://localhost:8888/s)]
2024-04-30 23:29:51,410 - BERTopic - Dimensionality - Reducing dimensionality of input embeddings.
2024-04-30 23:29:51,431 - BERTopic - Dimensionality - Completed ✓
2024-04-30 23:29:51,432 - BERTopic - Clustering - Approximating new points with `hdbscan_model`
2024-04-30 23:29:51,439 - BERTopic - Probabilities - Start calculation of probabilities with HDBSCAN
2024-04-30 23:29:51,446 - BERTopic - Probabilities - Completed ✓
2024-04-30 23:29:51,447 - BERTopic - Cluster - Completed ✓
 0     1176
-1      551
 1      390
 2      362
 4      221
 3      190
 5      157
 6      155
 7      131
 8      122
 9       95
 10      66
 11      57
 12      42
 13      42
 14      40
 15      20
 17      17
 16      12
Name: count, dtype: int64

beckernick avatar May 01 '24 03:05 beckernick

@beckernick

Took a little longer than I'd anticipated to get hands on the keyboard, but I've opened a PR that resolves this issue.

That is all too familiar these days! So thanks for taking the time to create the PR. When it passes, I'll go ahead and merge it in preparation for a minor release.

MaartenGr avatar May 07 '24 14:05 MaartenGr