BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

How to merge topics automatically after getting the potential hierarchy of all topics

Open syGOAT opened this issue 1 year ago • 1 comments

I have read this part of the official document: https://maartengr.github.io/BERTopic/getting_started/hierarchicaltopics/hierarchicaltopics.html#visualizations:~:text=Merge%20topics,-%C2%B6 It is realy a great way to creat the potential hierarchical nature of topics and merge topics! I have a further question, which is how to to merge topics automatically after getting the potential hierarchy of all topics? For example, when I ran:

from scipy.cluster import hierarchy as sch

linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = model.hierarchical_topics(abstracts, linkage_function=linkage_function)
model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

The figure is bellow: image I think the horizontal axis is 'distance' between each topic. The merge method in official document is specifying the indexes of the topics. How can I merge topics automatically if the 'distance' between two topics is less than a sertain number, such as 0.3? My model is defined like this:

umap_model = UMAP(n_neighbors=20, n_components=15, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = KMeans(n_clusters=100, random_state=42)  
vectorizer_model = CountVectorizer(stop_words="english")

seed_words = [
    'materials','physical', ...
]
ctfidf_model = ClassTfidfTransformer(
    seed_words=seed_words, 
    seed_multiplier=5
)
model = BERTopic(embedding_model='./multilingual-e5-large-instruct', 
                 umap_model=umap_model,
                 min_topic_size=50,
                 ctfidf_model=ctfidf_model,        
                 hdbscan_model=cluster_model,
                 vectorizer_model=vectorizer_model, 
)

Vision:

Name: bertopic
Version: 0.16.0
Summary: BERTopic performs topic Modeling with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/BERTopic
Author: Maarten P. Grootendorst
Author-email: [email protected]

syGOAT avatar Mar 29 '24 05:03 syGOAT

When you run .hierarchical_topics, you get a dataframe that specifies the potential merging of topics and their distances, namely the hierarchical_topics variable in your code.

You can use this to select a threshold that you think works best for your use case. Based on the filtered dataframe, you can then extract the sets of topics that should be merged and merge them with .merge_topics.

MaartenGr avatar Mar 29 '24 07:03 MaartenGr