BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Is lemmatization required?

Open sreemoyk opened this issue 3 years ago • 1 comments

After running the model, I observed that some of my topics contains words of similar meaning, such as 'gap' and 'gaps', 'improvement' and 'improvements'. As lemmatization can fix this problem, but BERTopic recommends no preprocessing, is it feasible? If yes how can we do that?

sreemoyk avatar Jul 26 '22 19:07 sreemoyk

Apologies for the late reply, life has been unexpectedly hectic lately!

Indeed, preprocessing is typically not necessary as it can influence the embedding creation process. However, you can use the CountVectorizer to preprocess the documents after you have created the embeddings and before creating the topic representations. There, you can define the procedure for lemmatization.

A second thing that you can try is to set the diversity parameter to a value larger than 0, for example 0.1. The diversity parameter helps to diversify the topic representations and typically removes similar words.

MaartenGr avatar Jul 31 '22 05:07 MaartenGr

Due to inactivity, I'll be closing this for now. Let me know if you have any other questions related to this and I'll make sure to re-open the issue!

MaartenGr avatar Sep 27 '22 08:09 MaartenGr