Unexpected Behavior After Merging Topics
As I run my analyses, I find that the outputs from my initial model are usually very good, but usually still require a certain amount of tweaking to work perfectly. I was using the reduce_topics method for this, but have recently switched to the merge_topics method for the greater degree of control that it affords. However, I have run into some issues:
-
When I try to get the topic mapping to each doc using
topic_model._map_predictions(topic_model.hdbscan_model.labels_), the topics that it outputs do not match those that are summarized when I runget_topic_info(). In a recent example, the summary showed 13 topics, while there were only 3 unique topics in the list of mappings fromhdbscan_model.labels_. -
Most of the visualizations look good, including: visualize_topics(), visualize_hierarchy(hierarchical_topics=hierarchical_topics), and visualize_topics_per_class. However, visualize_documents() appears to be using the bad mappings from problem 1, giving me an output like this:

It seems like theres must be accurate topic mappings stored somewhere, otherwise not of the visualizations and summaries would work. Am I missing something obvious?
Thanks!
Update: get_representative_docs() seems to be drawing from the original labels, rather than the merged labels.
Thank you for sharing this issue. From what I can see, there might be an issue with the way merge_topics is currently working but I cannot be sure. Can you share your entire code for getting these issues? Including training and merging topics.
Thanks Maarten! Sorry for the slight mess. I copied this out of the notebook that I've been troubleshooting in.
`
Calculate Embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=True)
Set Up Model
Initialize Topic Model
topic_model = BERTopic(verbose=True, n_gram_range = (1,2), min_topic_size=30, embedding_model=sentence_model, vectorizer_model=vectorizer_model) #Applying a custom coutn vectorizer in order to use our custom stopword list
GO!
Run it!
topics, probs = topic_model.fit_transform(docs, embeddings) topic_model.save("my_model", save_embedding_model=False)
topic_model.get_topic_info().head(1000)
topic_model.get_topic_info().to_excel('topic_summary.xlsx')
Save out labeled topics
output_df = copy.deepcopy(df_for_processing) output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)
output_df.to_excel('Labeled_Docs.xlsx')
Save out summary along with representative docs for each topic to help understand them
summary = topic_model.get_topic_info() summary.drop(0, inplace = True)
Iterate through representatice docs and add to summary
def get_rep_docs_by_row(row): return topic_model.get_representative_docs(row)
summary["Representative Docs"] = summary["Topic"].apply(get_rep_docs_by_row)
summary.to_excel('Topic Summary.xlsx')
fig = topic_model.visualize_topics() fig.write_html('Topic Distances.html')
topic_model.visualize_topics()
fig = topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False) fig.write_html('Embedding Space.html')
topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)
hierarchical_topics = topic_model.hierarchical_topics(docs, topics)
fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) fig.write_html("Topic Heirarchy.html")
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
Run the heirarchy visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings, hide_document_hover=False)
Topics by class
topics_per_class = topic_model.topics_per_class(docs, topics, classes=df_for_processing['Classes'])
fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True) fig.write_html('Topics by Class.html')
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)
Merge Topics
topic_model.load("my_model") # Reload model to avoid conflicts
topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model) # For applying changes to topic labeling (e.g., stopwords)
topics_to_merge = [[45,37,20,38,26, 22,10,27], [19,43], [23,36,35], [42,31,24,33,13,39,28], [16,1], [40,29,18], [47,34], [9,0,41,44,5], [21,6,2,30,8], [32,11]]
topic_model.merge_topics(docs, topics, topics_to_merge)
topic_model.save("my_model_merged", save_embedding_model=False)
topic_model.get_topic_info().head(1000)
Save out labeled topics
output_df = copy.deepcopy(df_for_processing) output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)
output_df.to_excel('Labeled_Docs_Merged.xlsx')
topic_model.visualize_topics()
Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)
hierarchical_topics = topic_model.hierarchical_topics(docs, topics)
fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) fig.write_html("Topic Heirarchy Merged.html")
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) fig.write_html("Topics by Class Merged.html")
topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) `
One thing that I've noticed is that it tends to get more and more messed up with each iteration, if I go through multiple rounds of merges.
I just checked the code of merge_topics and I believe I understand the issue here. It seems that the topics are not properly updated across some of the functions. It is something that definitely can be fixed but most likely will require some work across BERTopic.
Hey Maarten! It seems as if topic_model.merge_topics() still does not propagate the changes internally:
If I run topic_model.get_topic_info() topic_model.merge_topics(docs, [1,2]) and again topic_model.get_topic_info()
I get the same number of topics back. Is there any way of updating the topic_model manually?
@darebfh Which version of BERTopic are you currently using?
I'm using the latest build, 0.12.0:

Hey Maarten, sorry for bothering you, of course the error was on my side! ;) When creating the arrays of topics to be merged, I assumed that the user input would be implicitly converted to int, but it was parsed as string, hence not applying the merge at all. You COULD add a "Received string, expected int" exception, but of course this would be the cherry on top ;)
All the best and thanks for this awesome tool!
@darebfh Thanks for the kind words and glad to hear that the issue was resolved! I'll keep it in mind :)