BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

transform is generating different topic label

Open HengruiZhang opened this issue 3 years ago • 1 comments

Hello Maarten, first of all thank you so much for developing such an awesome project! I am currently running into an issue. When I use fit_transform function on a dataset, it will fit the data and generate topics. However, when I was trying to get a sample of the dataset and call transform function on the sample set, it seems like only around half of the label is the same as before. Do you have any idea what is going on here? Here's my model parameters and related code:

import pandas as pd
from sentence_transformers import SentenceTransformer
import umap
import hdbscan
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

model = SentenceTransformer('all-mpnet-base-v2')

umap_model = umap.UMAP(
    n_neighbors=100,
    n_components=10,
    min_dist=0.0,
    metric="cosine",
    random_state=42
)

hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=100,
                          metric='euclidean',                      
                          cluster_selection_method='eom',
                         prediction_data=True, 
                          min_samples=1, 
                          cluster_selection_epsilon=0.03)

stop_words = text.ENGLISH_STOP_WORDS.union(additional_stop_words) #additional_stop_words is a list of stopwords
countvec_model = CountVectorizer(ngram_range=(2, 3), stop_words=stop_words, min_df=5)
topic_model = BERTopic(verbose=True, hdbscan_model=hdbscan_model, umap_model=umap_model, vectorizer_model=countvec_model, top_n_words=10, nr_topics='auto', low_memory=True)
embeddings = model.encode(data['resolution'].to_list())
topics, probs = topic_model.fit_transform(data['resolution'].to_list(), embeddings)

# sample part
data_sample = data.sample(n=10000)
embeddings_sample = model.encode(data_sample['resolution'].to_list())
topics_sample, probs_sample = topic_model.transform(data_sample['resolution'].to_list(), embeddings_sample)

Compared to the topics generated from fit_transform() of the sample set and topics_sample, almost only 50% are the same. The size of data is around 300k

HengruiZhang avatar Aug 04 '22 23:08 HengruiZhang

This likely relates to the way HDBSCAN calculates predictions for new points. If you would use the exact same points in the exact same order, then it would predict the same topics. However, it seems that it generates different results if a different distribution of points is supplied. In other words, it uses all these points together to make its prediction. You can find a bit more about that here.

MaartenGr avatar Aug 05 '22 06:08 MaartenGr

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr