Can we have NounPhrases\ KeyPhrases as ngrams (maybe using KeyBert) instead of CountVectorizer ngrams?
Hi Maarten,
This is similar to #139 but slightly different, can we have more meaningful ngrams as top_n_words in topics, so instead of using CountVectorizer can we not use KeyBert to generate NounPhrases/ keyPhrases and use them as ngrams to represent the topics?
Thank you in advance.
--hubgitadi
The main difficulty here is that KeyBERT uses quite a different procedure from BERTopic and merging them would require some significant changes to both procedures.
Using KeyBERT directly, in place of c-TF-IDF, will likely result in overestimating representations since it then imitates, to a certain extent, centroid-based extraction whilst we are using a density-based clustering technique.
One thing that is in line with how both BERTopic and KeyBERT work is that you can use KeyBERT to generate candidate words/phrases for each document. Then, we could use c-TF-IDF to calculate the topic representations. This prevents centroid-based techniques at a topic level but does use it on a document level, which might be a bit more ideal.
In other words, you would have to perform the following:
- Use KeyBERT to extract the keywords for each document
- Convert all keywords to a single, flat list of candidate words that can be used in BERTopic
- This list will serve as the vocabulary of the CountVectorizer
- Pass that list to a custom CountVectorizer:
CountVectorizer(vocabulary=my_vocab)
I have not tested it out but it would be something like this:
# Extract candidates
keywords = kw_model.extract_keywords(doc)
flat_keywords = [k[0] for keyword in keywords for k in keyword]
flat_keywords = list(set(flat_keywords))
# Pass to BERTopic
vectorizer_model= CountVectorizer(vocabulary=flat_keywords)
topic_model = BERTopic(vectorizer_model=vectorizer_model)
Lovely, this would be an amazing thing to try.... let me try this and get back to you. Thank you for your prompt response.
Thank you Maarten, this is working. I am still analyzing the results, will keep you posted on this.