BERTopic multiple topics containing same words only in a different order

Hi Maarten,

I am experiencing the model outputting multiple topics that contain very similar keywords with maybe just the order or a couple of words different. For example, topics 0 and 1 of a recent run are named: 0_manage_quote_bingo_games 1_quote_manage_bingo_games

The model parameters are as follows: embedding_model=SentenceTransformer('all-MiniLM-L6-v2'), vectorizer_model=CountVectorizer(ngram_range=(1, 1), stop_words=stop_words), calculate_probabilities=True, nr_topics=200, top_n_words=10, verbose=True, min_topic_size=20, The dataset contains ~75k documents (each doc is about 2-3 sentences on avg) and the model produced 75 topics. Is this a result of requesting 200 topics when the model only "finds" 75? The idea behind requesting a high number of topics is to let it produce as many as it wants and we can reduce this number to create more generalized topics later if wanted. Any insight you have on this and methodologies for producing quality topics given any amount of data would be great. Thanks for all the great work.

Jul 04 '22 03:07 Sudo-Truman

Each run of BERTopic will create slightly different outputs because some of the underlying algorithms are stochastic. There are also differences between which documents get assigned to which topics, but in my experience these differences are very small. You can 'freeze' multiple models by ensuring they use the same random number seed from run to run by:

MyModel = BERTopic() MyModel.umap_model.random_state=42

Doesn't matter what value you use for the random_state - each model created with an identical number should be the same.

In terms of the production of 75 topics. This is a function of the HDBSCAN parameters and your dataset. The default parameters you are using are producing 75 clusters. You can tweak the params and the number of topics will change - see the FAQ for more info on this and also you can refer to the HDBSCAN tuning documentation for a detailed explanation of how HDBSCAN parameters effect cluster formation.

As you point out you can always try to create more topics (by tweaking HDBSCAN params in your case to increase from 75) and then reduce the topics within BERTopic using nr_topics.

I've been working on HDBSCAN tuning which is currently something best done outside of BERTopic proper. You can see a discussion about this at #582. You can also refer to this repository which is a tutorial of sorts on tuning HDBSCAN for BERTopic. This method is a bit in the weeds, but if you are interested I'm happy to provide input and help - just start a thread on Discussions.

Hope this helps.

Jul 04 '22 04:07 drob-xx

I am experiencing the model outputting multiple topics that contain very similar keywords with maybe just the order or a couple of words different. For example, topics 0 and 1 of a recent run are named: 0_manage_quote_bingo_games 1_quote_manage_bingo_games

Those are interesting results! It might happen that similar topics are split because although they use similar vocabulary, the context in which they appear might be very much different. It would be worthwhile to run topic_model.get_topic(0) and topic_model.get_topic(1) to see more words in the topic. Also, using topic_model.get_representative_docs(1) would give you an idea of the documents that typically make up that topic. With topic modeling, human evaluation is extremely important which is why there are many forms of visualizations and ways to inspect the topics.

The dataset contains ~75k documents (each doc is about 2-3 sentences on avg) and the model produced 75 topics. Is this a result of requesting 200 topics when the model only "finds" 75?

The nr_topics parameter is a bit tricky. It forces the number of topics to never be higher than that value. For example, if the model generates 100 topics and you have set nr_topics=75, it will reduce the number of topics from 100 to 75. However, if your model generates 50 topics and you have set nr_topics=60 it will produce 50 topics and not 60. In other words, nr_topics is used to reduce the number of topics and not increase them.

The idea behind requesting a high number of topics is to let it produce as many as it wants and we can reduce this number to create more generalized topics later if wanted. Any insight you have on this and methodologies for producing quality topics given any amount of data would be great. Thanks for all the great work.

You can use the min_topic_size parameter to change the number of topics generated. If you lower this parameter, you will get more topics and vice versa. You can use techniques like topic_model.visualize_hierarchy() or topic_model.visualize_heatmap() to get a rough idea of which topics are very similar to each other.

There is a new version in the work that implements a more extensive hierarchical topic model, with the option to manually merge topics here. If I read your use case correctly, this might be interesting to you as it would allow you to generate many topics and then manually merge them until you get those generalized topics.

As a final note, 75k documents is already quite a good size for training, so no worries there!

Jul 04 '22 16:07 MaartenGr

My bad on misreading the original post, I missed the point completely 🙁

Jul 04 '22 19:07 drob-xx