BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Can we have NounPhrases\ KeyPhrases as ngrams (maybe using KeyBert) instead of CountVectorizer ngrams?

Open hubgitadi opened this issue 3 years ago • 2 comments

Hi Maarten,

This is similar to #139 but slightly different, can we have more meaningful ngrams as top_n_words in topics, so instead of using CountVectorizer can we not use KeyBert to generate NounPhrases/ keyPhrases and use them as ngrams to represent the topics?

Thank you in advance.

--hubgitadi

hubgitadi avatar Jul 25 '22 06:07 hubgitadi

The main difficulty here is that KeyBERT uses quite a different procedure from BERTopic and merging them would require some significant changes to both procedures.

Using KeyBERT directly, in place of c-TF-IDF, will likely result in overestimating representations since it then imitates, to a certain extent, centroid-based extraction whilst we are using a density-based clustering technique.

One thing that is in line with how both BERTopic and KeyBERT work is that you can use KeyBERT to generate candidate words/phrases for each document. Then, we could use c-TF-IDF to calculate the topic representations. This prevents centroid-based techniques at a topic level but does use it on a document level, which might be a bit more ideal.

In other words, you would have to perform the following:

  • Use KeyBERT to extract the keywords for each document
  • Convert all keywords to a single, flat list of candidate words that can be used in BERTopic
    • This list will serve as the vocabulary of the CountVectorizer
  • Pass that list to a custom CountVectorizer: CountVectorizer(vocabulary=my_vocab)

I have not tested it out but it would be something like this:

# Extract candidates
keywords = kw_model.extract_keywords(doc)
flat_keywords = [k[0] for keyword in keywords for k in keyword]
flat_keywords = list(set(flat_keywords))

# Pass to BERTopic
vectorizer_model= CountVectorizer(vocabulary=flat_keywords)
topic_model = BERTopic(vectorizer_model=vectorizer_model)

MaartenGr avatar Jul 25 '22 09:07 MaartenGr

Lovely, this would be an amazing thing to try.... let me try this and get back to you. Thank you for your prompt response.

hubgitadi avatar Jul 25 '22 09:07 hubgitadi

Thank you Maarten, this is working. I am still analyzing the results, will keep you posted on this.

hubgitadi avatar Aug 12 '22 07:08 hubgitadi