Merge words in topic representation
Hi developers,
Thank you for your outstanding work. When I was examining the bar chart visualization of the generated topics, I noticed that some words in the representation are actually the same word, like "lab" and "labs", "bio" and "biology." My questions are as follows.
- Is there a convenient way to specify the words to be merged and update the visualization?
- If I have to update it manually, should I sum up the TF-IDF scores of the merged words as the new TF-IDF score?
- How to draw new words and their TF-IDF scores to fill up the blank left by the merged words? I saw the get_topic() method only returns at most 10 words.
Thank you for your kind words!
Typically, if words are very similar to one another, then it would help to set the diversity parameter when training BERTopic. It is a value between 0 and 1. The higher you set this value, the more diverse the words will be in the same representation. For example, words like "lab" and "labs" are very similar and typically only one will then be used if you have set a higher diversity. It helps to start this off at around 0.2 to get a feeling of its effect.
Is there a convenient way to specify the words to be merged and update the visualization?
You might be able to do it as follows:
topic_model.diversity = 0.2
topic_model.update_topics(docs, topics)
That way, the topic representations get re-calculated although this time with the updated diversity parameter.
If I have to update it manually, should I sum up the TF-IDF scores of the merged words as the new TF-IDF score?
If you were to do it manually, then that depends on your assumptions of merging them. The diversity metric does not merge them and only looks at them as separate, independent instances.
How to draw new words and their TF-IDF scores to fill up the blank left by the merged words? I saw the get_topic() method only returns at most 10 words.
This becomes rather tricky as you would have to update the bag of words first and remove those words from its vocabulary. What you could do is find all duplicate words and add them to the stop_words in a custom CountVectorizer model which you can then pass to .update_topics.
Hi Maarten,
Thank you for your illuminating reply and I will try out setting different diversity. Just for reference, I have also searched within existing issues a little bit more and found a solution that works pretty well for me (https://github.com/MaartenGr/BERTopic/issues/286).
Starting on the basis of that, to specify the words to merge, I have tried to create a spreadsheet to add their mapping relationships (bio - biology) and force each word in the text to be replaced by its standard form (a rather dumb method!). My code is like this:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, doc):
res = []
word_replacement = pd.read_excel(r"word_replacement.xlsx")
for t in word_tokenize(doc):
if str(t).isnumeric():
continue
t = self.wnl.lemmatize(t)
if t in word_replacement['before'].values:
t = word_replacement.loc[word_replacement['before'] == t, 'after'].values[0]
res.append(t)
return res
vectorizer_model = CountVectorizer(tokenizer=LemmaTokenizer(), ngram_range=(1, 2), stop_words="english",min_df=10)
topic_model.update_topics(docs, topics, vectorizer_model=vectorizer_model)
Glad to hear that that solution also works for you! The CountVectorizer definitely is a great method to further process the documents and get the topic representations that you are looking for. If you have any other questions and/or suggestions, please let me know :)