BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Unexpected Behavior After Merging Topics

Open CoandaEffect opened this issue 3 years ago • 5 comments

As I run my analyses, I find that the outputs from my initial model are usually very good, but usually still require a certain amount of tweaking to work perfectly. I was using the reduce_topics method for this, but have recently switched to the merge_topics method for the greater degree of control that it affords. However, I have run into some issues:

  1. When I try to get the topic mapping to each doc using topic_model._map_predictions(topic_model.hdbscan_model.labels_), the topics that it outputs do not match those that are summarized when I run get_topic_info(). In a recent example, the summary showed 13 topics, while there were only 3 unique topics in the list of mappings from hdbscan_model.labels_.

  2. Most of the visualizations look good, including: visualize_topics(), visualize_hierarchy(hierarchical_topics=hierarchical_topics), and visualize_topics_per_class. However, visualize_documents() appears to be using the bad mappings from problem 1, giving me an output like this:

Screen Shot 2022-07-22 at 11 20 24 AM

It seems like theres must be accurate topic mappings stored somewhere, otherwise not of the visualizations and summaries would work. Am I missing something obvious?

Thanks!

CoandaEffect avatar Jul 22 '22 17:07 CoandaEffect

Update: get_representative_docs() seems to be drawing from the original labels, rather than the merged labels.

CoandaEffect avatar Jul 22 '22 21:07 CoandaEffect

Thank you for sharing this issue. From what I can see, there might be an issue with the way merge_topics is currently working but I cannot be sure. Can you share your entire code for getting these issues? Including training and merging topics.

MaartenGr avatar Jul 23 '22 08:07 MaartenGr

Thanks Maarten! Sorry for the slight mess. I copied this out of the notebook that I've been troubleshooting in.

`

Calculate Embeddings

sentence_model = SentenceTransformer("all-MiniLM-L6-v2") embeddings = sentence_model.encode(docs, show_progress_bar=True)

Set Up Model

Initialize Topic Model

topic_model = BERTopic(verbose=True, n_gram_range = (1,2), min_topic_size=30, embedding_model=sentence_model, vectorizer_model=vectorizer_model) #Applying a custom coutn vectorizer in order to use our custom stopword list

GO!

Run it!

topics, probs = topic_model.fit_transform(docs, embeddings) topic_model.save("my_model", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

topic_model.get_topic_info().to_excel('topic_summary.xlsx')

Save out labeled topics

output_df = copy.deepcopy(df_for_processing) output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs.xlsx')

Save out summary along with representative docs for each topic to help understand them

summary = topic_model.get_topic_info() summary.drop(0, inplace = True)

Iterate through representatice docs and add to summary

def get_rep_docs_by_row(row): return topic_model.get_representative_docs(row)

summary["Representative Docs"] = summary["Topic"].apply(get_rep_docs_by_row)

summary.to_excel('Topic Summary.xlsx')

fig = topic_model.visualize_topics() fig.write_html('Topic Distances.html')

topic_model.visualize_topics()

fig = topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False) fig.write_html('Embedding Space.html')

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) fig.write_html("Topic Heirarchy.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

Run the heirarchy visualization with the original embeddings

topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings, hide_document_hover=False)

Topics by class

topics_per_class = topic_model.topics_per_class(docs, topics, classes=df_for_processing['Classes'])

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True) fig.write_html('Topics by Class.html')

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 30, normalize_frequency = True)

Merge Topics

topic_model.load("my_model") # Reload model to avoid conflicts

topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model) # For applying changes to topic labeling (e.g., stopwords)

topics_to_merge = [[45,37,20,38,26, 22,10,27], [19,43], [23,36,35], [42,31,24,33,13,39,28], [16,1], [40,29,18], [47,34], [9,0,41,44,5], [21,6,2,30,8], [32,11]]

topic_model.merge_topics(docs, topics, topics_to_merge)

topic_model.save("my_model_merged", save_embedding_model=False)

topic_model.get_topic_info().head(1000)

Save out labeled topics

output_df = copy.deepcopy(df_for_processing) output_df['Topics'] = topic_model.map_predictions(topic_model.hdbscan_model.labels)

output_df.to_excel('Labeled_Docs_Merged.xlsx')

topic_model.visualize_topics()

Run the visualization with the original embeddings

topic_model.visualize_documents(docs, embeddings=embeddings, hide_document_hover=False)

hierarchical_topics = topic_model.hierarchical_topics(docs, topics)

fig = topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) fig.write_html("Topic Heirarchy Merged.html")

topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

fig = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) fig.write_html("Topics by Class Merged.html")

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 20, normalize_frequency = True) `

CoandaEffect avatar Jul 25 '22 15:07 CoandaEffect

One thing that I've noticed is that it tends to get more and more messed up with each iteration, if I go through multiple rounds of merges.

CoandaEffect avatar Jul 25 '22 17:07 CoandaEffect

I just checked the code of merge_topics and I believe I understand the issue here. It seems that the topics are not properly updated across some of the functions. It is something that definitely can be fixed but most likely will require some work across BERTopic.

MaartenGr avatar Jul 25 '22 18:07 MaartenGr

Hey Maarten! It seems as if topic_model.merge_topics() still does not propagate the changes internally:

If I run topic_model.get_topic_info() topic_model.merge_topics(docs, [1,2]) and again topic_model.get_topic_info()

I get the same number of topics back. Is there any way of updating the topic_model manually?

darebfh avatar Nov 17 '22 08:11 darebfh

@darebfh Which version of BERTopic are you currently using?

MaartenGr avatar Nov 18 '22 06:11 MaartenGr

I'm using the latest build, 0.12.0: image

darebfh avatar Nov 18 '22 07:11 darebfh

Hey Maarten, sorry for bothering you, of course the error was on my side! ;) When creating the arrays of topics to be merged, I assumed that the user input would be implicitly converted to int, but it was parsed as string, hence not applying the merge at all. You COULD add a "Received string, expected int" exception, but of course this would be the cherry on top ;)

All the best and thanks for this awesome tool!

darebfh avatar Nov 18 '22 08:11 darebfh

@darebfh Thanks for the kind words and glad to hear that the issue was resolved! I'll keep it in mind :)

MaartenGr avatar Nov 19 '22 07:11 MaartenGr