BERTopic Can we have NounPhrases\ KeyPhrases as ngrams (maybe using KeyBert) instead of CountVectorizer ngrams?

Hi Maarten,

This is similar to #139 but slightly different, can we have more meaningful ngrams as top_n_words in topics, so instead of using CountVectorizer can we not use KeyBert to generate NounPhrases/ keyPhrases and use them as ngrams to represent the topics?

Thank you in advance.

--hubgitadi

Jul 25 '22 06:07 hubgitadi

The main difficulty here is that KeyBERT uses quite a different procedure from BERTopic and merging them would require some significant changes to both procedures.

Using KeyBERT directly, in place of c-TF-IDF, will likely result in overestimating representations since it then imitates, to a certain extent, centroid-based extraction whilst we are using a density-based clustering technique.

One thing that is in line with how both BERTopic and KeyBERT work is that you can use KeyBERT to generate candidate words/phrases for each document. Then, we could use c-TF-IDF to calculate the topic representations. This prevents centroid-based techniques at a topic level but does use it on a document level, which might be a bit more ideal.

In other words, you would have to perform the following:

Use KeyBERT to extract the keywords for each document
Convert all keywords to a single, flat list of candidate words that can be used in BERTopic
- This list will serve as the vocabulary of the CountVectorizer
Pass that list to a custom CountVectorizer: CountVectorizer(vocabulary=my_vocab)

I have not tested it out but it would be something like this:

# Extract candidates
keywords = kw_model.extract_keywords(doc)
flat_keywords = [k[0] for keyword in keywords for k in keyword]
flat_keywords = list(set(flat_keywords))

# Pass to BERTopic
vectorizer_model= CountVectorizer(vocabulary=flat_keywords)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

Jul 25 '22 09:07 MaartenGr

Lovely, this would be an amazing thing to try.... let me try this and get back to you. Thank you for your prompt response.

Jul 25 '22 09:07 hubgitadi

Thank you Maarten, this is working. I am still analyzing the results, will keep you posted on this.

Aug 12 '22 07:08 hubgitadi