BERTopic icon indicating copy to clipboard operation
BERTopic copied to clipboard

Show the distribution of top k documents in each topic

Open syGOAT opened this issue 1 year ago • 8 comments

Reference to https://github.com/MaartenGr/BERTopic/issues/93#:~:text=So%2C%20only%20set%20this%20to%20True%20if%20you%20have%20less%20than%20100.000%20documents., does it mean it will not compute probabilities when I have more than 100 documents (accurately, I have 40000+ documents) although setting calculate_probabilities=True?

My BERTopic model is:

umap_model = UMAP(n_neighbors=20, n_components=15, min_dist=0.0, metric='cosine', random_state=42)
cluster_model = KMeans(n_clusters=100, random_state=42)  # 100
vectorizer_model = CountVectorizer(stop_words="english")

ctfidf_model = ClassTfidfTransformer(
    seed_words=seed_words, 
    seed_multiplier=5
)

model = BERTopic(embedding_model='./paraphrase-MiniLM-L6-v2', 
                 umap_model=umap_model,
                 min_topic_size=50,
                 ctfidf_model=ctfidf_model,        
                 hdbscan_model=cluster_model,
                 vectorizer_model=vectorizer_model, 
                 calculate_probabilities=True
)

Then run this code:

topics, probabilities = model.fit_transform(abstracts)
print(probabilities)

The output is None. Version:

# pip show bertopic
Name: bertopic
Version: 0.16.0
Summary: BERTopic performs topic Modeling with state-of-the-art transformer models.
Home-page: https://github.com/MaartenGr/BERTopic

I know how stressful it would be to compute all the probabilities. If the answer to the question at the beginning of this issue is yes, could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?

syGOAT avatar Mar 20 '24 03:03 syGOAT

I have read this FAQ: https://maartengr.github.io/BERTopic/faq.html#how-do-i-calculate-the-probabilities-of-all-topics-in-a-document and known why my probabilities is None. Thank you so much for making this useful library! In addtion, .approximate_distribution models the distribution of topics in the documents. It's documents to topics not topics to dpcuments. I would reserve the request above, which is showing the distribution of top k documents in each topic.

syGOAT avatar Mar 20 '24 06:03 syGOAT

Reference to https://github.com/MaartenGr/BERTopic/issues/93., does it mean it will not compute probabilities when I have more than 100 documents (accurately, I have 40000+ documents) although setting calculate_probabilities=True?

No, the reason it does not compute probabilities is that you are using k-Means which does not generate probabilities. Instead, if you save the model as either pytorch or safetensors, then load it and use .transform it will generate probabilities without k-Means instead.

MaartenGr avatar Mar 20 '24 09:03 MaartenGr

@MaartenGr Thankx for your reply! Would you mind answering another question of mine?

could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?

In addtion, .approximate_distribution models the distribution of topics in the documents. It's documents to topics not topics to dpcuments. I would reserve the request above, which is showing the distribution of top k documents in each topic.

syGOAT avatar Mar 20 '24 10:03 syGOAT

could you please add a parameter to let users set top k documents to show probabilities of most similar k documents for each topic?

Ah, you mean the most representative documents per topic, right? If so, currently you get the top 3 most representative documents automatically. If you want more, you might need to use ._extract_representative_docs but it would be nicer if this was a public function and not a private function.

If someone is interested in working on this, this can be added as an additional parameter in the get_representative_docs parameter as a way to recalculate the most representative document.

MaartenGr avatar Mar 20 '24 10:03 MaartenGr

If someone is interested in working on this, this can be added as an additional parameter in the get_representative_docs parameter as a way to recalculate the most representative document.

@MaartenGr Great! I'm modifying the source code to do this. Hope I get a chance to post a PR. But I have one small question. After calling the function .fit_transform (My code is above), I ran this code:

model.get_topic_info().loc[0, 'Representative_Docs']

Output: image But when running this code:

doc_topic = pd.DataFrame({
    'Topic': model.topics_,
    'ID': range(len(model.topics_)),
    'Document': abstracts}
)

repr_docs, _, _, _ = model._extract_representative_docs(
    c_tf_idf=model.c_tf_idf_, 
    documents=doc_topic,
    topics=model.topic_representations_,
    nr_samples=1000,
    nr_repr_docs=5
    )
print(repr_docs)

The output of topic 0 was different from the first method: image Could you please tell me what was going on?

syGOAT avatar Mar 21 '24 10:03 syGOAT

Well, you changed a parameter, namely nr_samples from the original 500 to 1000, so it makes sense that then the output would change. Also, make sure to also include "Image": images, in the dataframe as it might lead to errors otherwise.

MaartenGr avatar Mar 21 '24 12:03 MaartenGr

Thank you for your reply and for your contribution to the application of clustering!

syGOAT avatar Mar 29 '24 06:03 syGOAT

No problem, glad I could be of help. Feel free to reach out if you have any other questions.

MaartenGr avatar Mar 29 '24 07:03 MaartenGr